Closed KonradHoeffner closed 1 year ago
HDT size in memory > 2.4 GB, details:
Hdt {
dict: FourSectDict {
shared: total size 17.5 MB, sequence 217.6 KB with 69619 entries, 25 bits per entry, packed data 17.2 MB,
subjects: total size 229.9 MB, sequence 1.2 MB with 333197 entries, 28 bits per entry, packed data 228.7 MB,
predicates: total size 19.7 KB, sequence 192 B with 102 entries, 15 bits per entry, packed data 19.5 KB,
objects: total size 443.2 MB, sequence 4.2 MB with 1147322 entries, 29 bits per entry, packed data 439.0 MB,
},
triple_sect: Bitmap(
total size 1.7 GB
adjlist_y AdjList {
sequence: 156.6 MB with 113873721 entries, 11 bits per entry,
bitmap: Bitmap size unknown,
}
adjlist_z AdjList {
sequence: 407.1 MB with 120635813 entries, 27 bits per entry,
bitmap: Bitmap size unknown,
}
op_index AdjList {
sequence: 965.1 MB with 120635813 entries, 64 bits per entry,
bitmap: Bitmap size unknown,
}
wavelet_y 205.5 MB
,
),
Why are 102 predicates encoded with 15 bits? They should only need 7. objects should also take around 21 bits not 29. Are there 8 bits too many used everywhere? But it is loaded like this from the hdt-cpp output...
HDT size in memory > 1.9 GB, details:
Hdt {
dict: FourSectDict {
shared: total size 17.5 MB, sequence 217.6 KB with 69619 entries, 25 bits per entry, packed data 17.2 MB,
subjects: total size 229.9 MB, sequence 1.2 MB with 333197 entries, 28 bits per entry, packed data 228.7 MB,
predicates: total size 19.7 KB, sequence 192 B with 102 entries, 15 bits per entry, packed data 19.5 KB,
objects: total size 443.2 MB, sequence 4.2 MB with 1147322 entries, 29 bits per entry, packed data 439.0 MB,
},
triple_sect: Bitmap(
total size 1.3 GB
adjlist_y AdjList {
sequence: 156.6 MB with 113873721 entries, 11 bits per entry,
bitmap: Bitmap size unknown,
}
adjlist_z AdjList {
sequence: 407.1 MB with 120635813 entries, 27 bits per entry,
bitmap: Bitmap size unknown,
}
op_index total size 482.5 MB
sequence 482.5 MB
bitmapBitmap size unknown
wavelet_y 205.5 MB
,
),
}
HDT size in memory > 1.9 GB, details:
Hdt {
dict: FourSectDict {
shared: total size 17.5 MB, sequence 217.6 KB with 69619 entries, 25 bits per entry, packed data 17.2 MB,
subjects: total size 229.9 MB, sequence 1.2 MB with 333197 entries, 28 bits per entry, packed data 228.7 MB,
predicates: total size 19.7 KB, sequence 192 B with 102 entries, 15 bits per entry, packed data 19.5 KB,
objects: total size 443.2 MB, sequence 4.2 MB with 1147322 entries, 29 bits per entry, packed data 439.0 MB,
},
triple_sect: Bitmap(
total size 1.2 GB
adjlist_y AdjList {
sequence: 156.6 MB with 113873721 entries, 11 bits per entry,
bitmap: Bitmap size unknown,
}
adjlist_z AdjList {
sequence: 407.1 MB with 120635813 entries, 27 bits per entry,
bitmap: Bitmap size unknown,
}
op_index total size 407.1 MB
sequence 3.3 GB with 27 bits
bitmapBitmap size unknown
wavelet_y 205.5 MB
,
),
With below 2 GB of RAM usage including indexes for a 1.3 GB HDT file I'm now at a point I consider acceptable for now. However the big spike in temporary memory allocation beforehand may cause a server to crash, so this should be dropped to 4 GB or less.
Without the indexes it's much less:
Both CompactVector use a slices for building, would it be possible to save space using a consuming iterator? Is that supported?
Down to 5.5 GB peak, more next week.
Test if a HashMap needs less space than a BTreeMap. Update: Won't work, vec's aren't hashable. Another update: It's dense, just use a vec of vecs instead of a map.
We could reorder the generation statements so that the matrix is generated as early as possible.
Much lower peak, only around 800 MB to go!
The big spike comes from sucds::wavelet_matrix::WaveletMatrix::from_ints, opened an issue at https://github.com/kampersanda/sucds/issues/44. Will close this issue and continue when there is a response in that issue, or if not investigate another library later.
Brought significantly below 4 GB in #16.
heaptrack lscomplete20143.hdt
without wavelet matrix