RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.14k stars 555 forks source link

Unusually high memory consumption in SPARQL query #2147

Closed karlb closed 3 weeks ago

karlb commented 1 year ago

I am having problems with queries being killed due to exceeding the available memory on my machine. I'm using rdflib-hdt to work on a 70MB HDT file and don't see how querying that file could reasonably use up more than 10GB of RAM memory. Is there a way to deal with such a situation other than just writing simpler queries? Any way as a user to find out where the problem comes from (in an SQL query plan I would look for things like large sort buffers)?

The query I'm using is

    SELECT ?lexentry ?other_form
    WHERE {
        ?lexentry a ontolex:LexicalEntry ;
                  ontolex:otherForm ?other_form .

        ?other_form ontolex:writtenRep ?other_written .
        OPTIONAL { ?lexentry lexinfo:partOfSpeech ?pos }

        OPTIONAL { ?other_form olia:hasMood ?mood }
        OPTIONAL { ?other_form olia:hasNumber ?number }
        OPTIONAL { ?other_form olia:hasPerson ?person }
        OPTIONAL { ?other_form olia:hasTense ?tense }
        OPTIONAL { ?other_form olia:hasVoice ?voice }

        OPTIONAL { ?other_form olia:hasCase ?case }
        OPTIONAL { ?other_form olia:hasInflectionType ?inflection }
        OPTIONAL { ?other_form olia:hasDefiniteness ?definiteness }
        OPTIONAL { ?other_form olia:hasGender ?gender }
    }

I can provide exact code and data to reproduce this, if it is of any help. I have executed the same query successfully on the same dataset in virtuoso.

Versions used:

Python 3.9.2
rdflib        6.2.0
rdflib-hdt    3.0

According to https://github.com/RDFLib/rdflib-hdt/issues/17#issuecomment-1296772401, the HDT-Store is unlikely to be the culprit.

karlb commented 1 year ago

Any feedback on this would be welcome, even it if is something like "what you do is a bad idea, because ...", "queries with many optionals are expected to use much memory in our current implementation" or "looks like a bug, but I don't have the time to look into it".

aucampia commented 1 year ago

@karlb we likely won't get time to look into performance any time soon, but if you have time to look at it, please do. We are always happy for pull requests.

It may be that other stores than the default one work better.

alessio-locatelli commented 1 month ago

@karlb Is this still a problem? Can you show pip freeze output and show a small reproducible example that I can copy and run? Otherwise, please close the issue.

karlb commented 3 weeks ago

I'm currently lacking the right input files to reproduce my case. I've deleted them locally to save disk space and they are not offered for download at the moment. I'll give an update once that changes and reopen if the problem still persists.