KonradHoeffner / hdt

Library for the Header Dictionary Triples (HDT) compression file format for RDF data.
https://crates.io/crates/hdt
MIT License
19 stars 4 forks source link

measure and optimize RAM consumption of the different components #11

Closed KonradHoeffner closed 1 year ago

KonradHoeffner commented 1 year ago

heaptrack lscomplete20143.hdt

Screenshot from 2022-11-18 09-49-06

without wavelet matrix

Screenshot from 2022-11-18 09-54-43

KonradHoeffner commented 1 year ago

qbench2

without wavelet matrix

image

Screenshot from 2022-11-18 10-29-35

KonradHoeffner commented 1 year ago
KonradHoeffner commented 1 year ago
HDT size in memory > 2.4 GB, details:
Hdt {
    dict: FourSectDict {
        shared: total size 17.5 MB, sequence 217.6 KB with 69619 entries, 25 bits per entry, packed data 17.2 MB,
        subjects: total size 229.9 MB, sequence 1.2 MB with 333197 entries, 28 bits per entry, packed data 228.7 MB,
        predicates: total size 19.7 KB, sequence 192 B with 102 entries, 15 bits per entry, packed data 19.5 KB,
        objects: total size 443.2 MB, sequence 4.2 MB with 1147322 entries, 29 bits per entry, packed data 439.0 MB,
    },
    triple_sect: Bitmap(
        total size 1.7 GB
        adjlist_y AdjList {
            sequence: 156.6 MB with 113873721 entries, 11 bits per entry,
            bitmap: Bitmap size unknown,
        }
        adjlist_z AdjList {
            sequence: 407.1 MB with 120635813 entries, 27 bits per entry,
            bitmap: Bitmap size unknown,
        }
        op_index AdjList {
            sequence: 965.1 MB with 120635813 entries, 64 bits per entry,
            bitmap: Bitmap size unknown,
        }
        wavelet_y 205.5 MB
        ,
    ),

Why are 102 predicates encoded with 15 bits? They should only need 7. objects should also take around 21 bits not 29. Are there 8 bits too many used everywhere? But it is loaded like this from the hdt-cpp output...

KonradHoeffner commented 1 year ago

Optimization Candidates

KonradHoeffner commented 1 year ago

32 bit OP index sequence

HDT size in memory > 1.9 GB, details:
Hdt {
    dict: FourSectDict {
        shared: total size 17.5 MB, sequence 217.6 KB with 69619 entries, 25 bits per entry, packed data 17.2 MB,
        subjects: total size 229.9 MB, sequence 1.2 MB with 333197 entries, 28 bits per entry, packed data 228.7 MB,
        predicates: total size 19.7 KB, sequence 192 B with 102 entries, 15 bits per entry, packed data 19.5 KB,
        objects: total size 443.2 MB, sequence 4.2 MB with 1147322 entries, 29 bits per entry, packed data 439.0 MB,
    },
    triple_sect: Bitmap(
        total size 1.3 GB
        adjlist_y AdjList {
            sequence: 156.6 MB with 113873721 entries, 11 bits per entry,
            bitmap: Bitmap size unknown,
        }
        adjlist_z AdjList {
            sequence: 407.1 MB with 120635813 entries, 27 bits per entry,
            bitmap: Bitmap size unknown,
        }
        op_index total size 482.5 MB
        sequence 482.5 MB
        bitmapBitmap size unknown

        wavelet_y 205.5 MB
        ,
    ),
}
KonradHoeffner commented 1 year ago

variable bit OP index sequence

HDT size in memory > 1.9 GB, details:
Hdt {
    dict: FourSectDict {
        shared: total size 17.5 MB, sequence 217.6 KB with 69619 entries, 25 bits per entry, packed data 17.2 MB,
        subjects: total size 229.9 MB, sequence 1.2 MB with 333197 entries, 28 bits per entry, packed data 228.7 MB,
        predicates: total size 19.7 KB, sequence 192 B with 102 entries, 15 bits per entry, packed data 19.5 KB,
        objects: total size 443.2 MB, sequence 4.2 MB with 1147322 entries, 29 bits per entry, packed data 439.0 MB,
    },
    triple_sect: Bitmap(
        total size 1.2 GB
        adjlist_y AdjList {
            sequence: 156.6 MB with 113873721 entries, 11 bits per entry,
            bitmap: Bitmap size unknown,
        }
        adjlist_z AdjList {
            sequence: 407.1 MB with 120635813 entries, 27 bits per entry,
            bitmap: Bitmap size unknown,
        }
        op_index total size 407.1 MB
        sequence 3.3 GB with 27 bits
        bitmapBitmap size unknown

        wavelet_y 205.5 MB
        ,
    ),

Screenshot from 2022-11-18 18-03-19

KonradHoeffner commented 1 year ago

With below 2 GB of RAM usage including indexes for a 1.3 GB HDT file I'm now at a point I consider acceptable for now. However the big spike in temporary memory allocation beforehand may cause a server to crash, so this should be dropped to 4 GB or less.

KonradHoeffner commented 1 year ago

Without the indexes it's much less: Screenshot from 2022-11-18 18-13-47

KonradHoeffner commented 1 year ago

Both CompactVector use a slices for building, would it be possible to save space using a consuming iterator? Is that supported?

KonradHoeffner commented 1 year ago

Down to 5.5 GB peak, more next week. Screenshot from 2022-11-18 18-33-39

KonradHoeffner commented 1 year ago

Test if a HashMap needs less space than a BTreeMap. Update: Won't work, vec's aren't hashable. Another update: It's dense, just use a vec of vecs instead of a map.

KonradHoeffner commented 1 year ago

We could reorder the generation statements so that the matrix is generated as early as possible.

KonradHoeffner commented 1 year ago

Much lower peak, only around 800 MB to go! Screenshot from 2022-11-18 18-48-09

KonradHoeffner commented 1 year ago

The big spike comes from sucds::wavelet_matrix::WaveletMatrix::from_ints, opened an issue at https://github.com/kampersanda/sucds/issues/44. Will close this issue and continue when there is a response in that issue, or if not investigate another library later.

KonradHoeffner commented 1 year ago

Brought significantly below 4 GB in #16.