LMDB OOM's for a Large Dataset

benherber commented 2 months ago

Current Behavior

Currently exploring if we can get one of the SAIL implementations to scale to the use cases we have (>= single-digit Billion Triples in some cases. The LMDB SAIL seems like it may be able to handle this (https://github.com/eclipse-rdf4j/rdf4j/discussions/3706#discussioncomment-2285945); However, I am getting an OOM error on some (not all queries).

More specifically we are using the SP2B benchmark to test this: https://dbis.informatik.uni-freiburg.de/forschung/projekte/SP2B/ using the bundled generator to populate the store.

The query that we first ran into was Q2:

PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
PREFIX swrc:    <http://swrc.ontoware.org/ontology#>
PREFIX foaf:    <http://xmlns.com/foaf/0.1/>
PREFIX bench:   <http://localhost/vocabulary/bench/>
PREFIX dc:      <http://purl.org/dc/elements/1.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?inproc ?author ?booktitle ?title ?proc ?ee ?page ?url ?yr ?abstract
WHERE {
    ?inproc rdf:type bench:Inproceedings .
    ?inproc dc:creator ?author .
    ?inproc bench:booktitle ?booktitle .
    ?inproc dc:title ?title .
    ?inproc dcterms:partOf ?proc .
    ?inproc rdfs:seeAlso ?ee .
    ?inproc swrc:pages ?page .
    ?inproc foaf:homepage ?url .
    ?inproc dcterms:issued ?yr
    OPTIONAL {
         ?inproc bench:abstract ?abstract
    }
}
ORDER BY ?yr

Initially I ran this with the default JVM heap etc. and it OOM'd after a period of time. I whacked up the heap space to 48G on my 96G machine and it hasn't OOM'd so far.

Expected Behavior

Given the iterator design I would've expected that the query may be slow but shouldn't OOM during evaluation, is that understanding not correct?

Steps To Reproduce

Generate and load LMDB Store with 1 Billlion dataset using SP2B: https://dbis.informatik.uni-freiburg.de/forschung/projekte/SP2B/
Run SP2B Q2 over dataset using default JVM settings

Version

4.3.11

Are you interested in contributing a solution yourself?

Perhaps?

Anything else?

The store was able to load 1 Billion on my machine in ~5.5-6 hrs using write-batches of around a 1000 triples which was really nice!

kenwenzel commented 2 months ago

It could be due to the order by clause that needs to materialize all values in a sorted set.

@JervenBolleman I think you have worked on the persistent sets?

@benherber BTW, you should/could use write batches with 100k triples. It would also be better to use the 5.0.0-SNAPSHOT AS the 4.x.x version hast several bugs in the LmdbStore.

benherber commented 2 months ago

It could be due to the order by clause that needs to materialize all values in a sorted set.

@JervenBolleman I think you have worked on the persistent sets?

@benherber BTW, you should/could use write batches with 100k triples. It would also be better to use the 5.0.0-SNAPSHOT AS the 4.x.x version hast several bugs in the LmdbStore.

Ah that would make sense. Oh good to know! Will try it out thanks. Just to confirm, since 5.0.0 is coming down the pipe rather soon, does that mean that the lmdb implementation in 4.x.x will not get future bug fixes?

kenwenzel commented 2 months ago

The current Implementation in 4.x.x is experimental and I've made some fixes and enhancements in the develop branch. Those could be backported to 4.x.x but I don't know the correct procedure for doing this.

@hmottestad Could you give some advice here?

hmottestad commented 2 months ago

You can create a PR with the fixes you want to backport and we can merge them into main. If the code ends up being identical between main and develop then there shouldn't be any problems. If not then we will need to be a bit more careful when merging main into develop later.

kenwenzel commented 2 months ago

@benherber How do you execute the queries? Are all results materialized?

benherber commented 2 months ago

@benherber How do you execute the queries? Are all results materialized?

I just iterate through the result set, counting the number of triples:

try (final TupleQueryResult res = query.evaluate()) {
    for (final BindingSet set : res) {
        for (final var ignore : set) {
            count++;
        }
    }
}

kenwenzel commented 2 months ago

OK, that looks good. Could you investigate the memory usage with VisualVM while running the query?

benherber commented 2 months ago

OK, that looks good. Could you investigate the memory usage with VisualVM while running the query?

Yea I'll probably get to doing that later this week if I get the chance. Will update once I do

eclipse-rdf4j / rdf4j