Queries with more than 20 triple patterns are never solved

tarcisiotmf commented 1 month ago

When executing the queries below with qlever, they are never solved and the error below is shown after 5 minutes. The same queries were tested with graphdb and they are solved in less than 2 seconds. You can replicate the issue with the following links:

{
    "exception": "Query timed out. Last operation: Query planning",
    "query": "# Among Melochia umbellata LCMS features in PI mode,\n# get the ones that are annoatted as [M+H]+ by SIRIUS and for which\n# a LCMS feature in NI mode with the corresponding [M-H]- m/z is found.\n\nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\nPREFIX wdt: <http://www.wikidata.org/prop/direct/>\nPREFIX xsd: <http://www.w3.org/2001/XMLSchema#>\nPREFIX emi: <https://purl.org/emi#>\nPREFIX sosa: <http://www.w3.org/ns/sosa/>\nPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\nPREFIX prov: <http://www.w3.org/ns/prov#>\nSELECT DISTINCT ?lcms_opp ?feature ?rt ?pm ?feature_opp ?rt_opp ?pm_opp\nWHERE\n    { \n    VALUES ?ppm {\n        \"5\"^^xsd:decimal # m/z tolerance\n        }\n    VALUES ?rt_tol {\n        \"0.05\"^^xsd:decimal # RT tolerance (minute)\n        }\n    ?sample rdf:type emi:ExtractSample.\n    ?sample sosa:isSampleOf* ?organe .\n    ?organe emi:inTaxon ?taxon . \n    ?taxon rdfs:label \"melochia umbellata\" .\n    ?sample sosa:isFeatureOfInterestOf ?lcms .\n    ?lcms sosa:hasResult ?feature_list .  \n    ?lcms rdf:type emi:LCMSAnalysisPos .\n    ?feature_list emi:hasLCMSFeature ?feature .                    \n    ?feature emi:hasParentMass ?pm .\n    ?feature emi:hasRetentionTime  ?rt .\n\t?feature emi:hasAnnotation ?sirius .\n\t?sirius rdf:type emi:StructuralAnnotation .\n    ?sirius prov:wasGeneratedBy ?activiy .\n    ?activiy prov:wasAssociatedWith <https://bio.informatik.uni-jena.de/software/sirius> .\n    ?sirius emi:hasAdduct ?adduct .\n \tFILTER(regex(str(?adduct), \"[M+H]+\"))       \n    ?sample sosa:isFeatureOfInterestOf ?lcms_opp .\n    ?lcms_opp rdf:type emi:LCMSAnalysisNeg .\n    ?lcms_opp sosa:hasResult ?feature_list_opp .\n    ?feature_list_opp emi:hasLCMSFeature ?feature_opp .\n\t?feature_opp emi:hasParentMass ?pm_opp .\n    ?feature_opp emi:hasRetentionTime ?rt_opp .\n    FILTER(((?rt - ?rt_tol) < ?rt_opp) && ((?rt + ?rt_tol) > ?rt_opp))\n    FILTER((?pm_opp > ((?pm - 2.014) - ((?ppm * 0.000001) * (?pm - 2.014)))) && (?pm_opp < ((?pm - 2.014) + ((?ppm * 0.000001) * (?pm - 2.014)))))\n    }\n",
    "resultsize": 0,
    "status": "ERROR",
    "time": {
        "computeResult": 300403,
        "total": 300403
    }
}

Executing query with qlever)%20%20%20%20%20%20%20%0A%20%20%20%20%3Fsample%20sosa%3AisFeatureOfInterestOf%20%3Flcms_opp%20.%0A%20%20%20%20%3Flcms_opp%20rdf%3Atype%20emi%3ALCMSAnalysisNeg%20.%0A%20%20%20%20%3Flcms_opp%20sosa%3AhasResult%20%3Ffeature_list_opp%20.%0A%20%20%20%20%3Ffeature_list_opp%20emi%3AhasLCMSFeature%20%3Ffeature_opp%20.%0A%09%3Ffeature_opp%20emi%3AhasParentMass%20%3Fpm_opp%20.%0A%20%20%20%20%3Ffeature_opp%20emi%3AhasRetentionTime%20%3Frt_opp%20.%0A%20%20%20%20FILTER(((%3Frt%20-%20%3Frt_tol)%20%3C%20%3Frt_opp)%20%26%26%20((%3Frt%20%2B%20%3Frt_tol)%20%3E%20%3Frt_opp))%0A%20%20%20%20FILTER((%3Fpm_opp%20%3E%20((%3Fpm%20-%202.014)%20-%20((%3Fppm%20%200.000001)%20%20(%3Fpm%20-%202.014))))%20%26%26%20(%3Fpm_opp%20%3C%20((%3Fpm%20-%202.014)%20%2B%20((%3Fppm%20%200.000001)%20%20(%3Fpm%20-%202.014)))))%0A%20%20%20%20%7D%0A)

Executing query with graphdb, select emi-dbgi repository)%20%20%20%20%20%20%20%0A%20%20%20%20%3Fsample%20sosa%3AisFeatureOfInterestOf%20%3Flcms_opp%20.%0A%20%20%20%20%3Flcms_opp%20rdf%3Atype%20emi%3ALCMSAnalysisNeg%20.%0A%20%20%20%20%3Flcms_opp%20sosa%3AhasResult%20%3Ffeature_list_opp%20.%0A%20%20%20%20%3Ffeature_list_opp%20emi%3AhasLCMSFeature%20%3Ffeature_opp%20.%0A%09%3Ffeature_opp%20emi%3AhasParentMass%20%3Fpm_opp%20.%0A%20%20%20%20%3Ffeature_opp%20emi%3AhasRetentionTime%20%3Frt_opp%20.%0A%20%20%20%20FILTER(((%3Frt%20-%20%3Frt_tol)%20%3C%20%3Frt_opp)%20%26%26%20((%3Frt%20%2B%20%3Frt_tol)%20%3E%20%3Frt_opp))%0A%20%20%20%20FILTER((%3Fpm_opp%20%3E%20((%3Fpm%20-%202.014)%20-%20((%3Fppm%20%200.000001)%20%20(%3Fpm%20-%202.014))))%20%26%26%20(%3Fpm_opp%20%3C%20((%3Fpm%20-%202.014)%20%2B%20((%3Fppm%20%200.000001)%20%20(%3Fpm%20-%202.014)))))%0A%20%20%20%20%7D%0A)

The dataset used in our test is available here.

For your information, I also tried to simplify the Query 2 by removing property paths, Values and filters but it still did not work (see query 3 below).

Query 1:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX emi: <https://purl.org/emi#>
PREFIX sosa: <http://www.w3.org/ns/sosa/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX prov: <http://www.w3.org/ns/prov#>
SELECT DISTINCT ?lcms_opp ?feature ?rt ?pm ?feature_opp ?rt_opp ?pm_opp
WHERE
    { 
    VALUES ?ppm {
        "5"^^xsd:decimal # m/z tolerance
        }
    VALUES ?rt_tol {
        "0.05"^^xsd:decimal # RT tolerance (minute)
        }
    ?sample rdf:type emi:ExtractSample.
    ?sample sosa:isSampleOf* ?organe .
    ?organe emi:inTaxon ?taxon . 
    ?taxon rdfs:label "melochia umbellata" .
    ?sample sosa:isFeatureOfInterestOf ?lcms .
    ?lcms sosa:hasResult ?feature_list .  
    ?lcms rdf:type emi:LCMSAnalysisPos .
    ?feature_list emi:hasLCMSFeature ?feature .                    
    ?feature emi:hasParentMass ?pm .
    ?feature emi:hasRetentionTime  ?rt .
    ?feature emi:hasAnnotation ?sirius .
    ?sirius rdf:type emi:StructuralAnnotation .
    ?sirius prov:wasGeneratedBy ?activiy .
    ?activiy prov:wasAssociatedWith <https://bio.informatik.uni-jena.de/software/sirius> .
    ?sirius emi:hasAdduct ?adduct .
    FILTER(regex(str(?adduct), "[M+H]+"))       
    ?sample sosa:isFeatureOfInterestOf ?lcms_opp .
    ?lcms_opp rdf:type emi:LCMSAnalysisNeg .
    ?lcms_opp sosa:hasResult ?feature_list_opp .
    ?feature_list_opp emi:hasLCMSFeature ?feature_opp .
    ?feature_opp emi:hasParentMass ?pm_opp .
    ?feature_opp emi:hasRetentionTime ?rt_opp .
    FILTER(((?rt - ?rt_tol) < ?rt_opp) && ((?rt + ?rt_tol) > ?rt_opp))
    FILTER((?pm_opp > ((?pm - 2.014) - ((?ppm * 0.000001) * (?pm - 2.014)))) && (?pm_opp < ((?pm - 2.014) + ((?ppm * 0.000001) * (?pm - 2.014)))))
    }

Query 2:

# Get the PI mode LCMS features with SIRIUS annotation for which
# a LCMS feature in NI mode of the same extract is annotated with
# the same IK2D and has the same RT.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>          
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX emi: <https://purl.org/emi#>
PREFIX sosa: <http://www.w3.org/ns/sosa/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX prov: <http://www.w3.org/ns/prov#>
SELECT DISTINCT ?feature ?feature_opp ?ik2d ?rt ?rt_opp
WHERE
    {
    VALUES ?rt_tol {
        "0.05"^^xsd:decimal # RT tolerance (minute)
        }
    ?sample rdf:type emi:ExtractSample .
    ?sample sosa:isSampleOf* ?organe .
    ?organe emi:inTaxon ?taxon . 
    ?taxon rdfs:label "melochia umbellata" .
    ?sample sosa:isFeatureOfInterestOf ?lcms .
    ?lcms sosa:hasResult ?feature_list .  
    ?lcms rdf:type emi:LCMSAnalysisPos .
    ?feature_list emi:hasLCMSFeature ?feature .
    ?feature emi:hasRetentionTime  ?rt .
    ?feature emi:hasAnnotation ?sirius .
    ?sirius rdf:type emi:StructuralAnnotation .
    ?sirius prov:wasGeneratedBy ?activiy .
    ?activiy prov:wasAssociatedWith <https://bio.informatik.uni-jena.de/software/sirius> .
    ?sirius emi:hasChemicalStructure ?ik2d .
    ?sample sosa:isFeatureOfInterestOf ?lcms_opp .
    ?lcms_opp rdf:type emi:LCMSAnalysisNeg .
    ?lcms_opp sosa:hasResult ?feature_list_opp .
    ?feature_list_opp emi:hasLCMSFeature ?feature_opp .
    ?feature_opp emi:hasRetentionTime ?rt_opp .                
    ?feature_opp emi:hasAnnotation ?sirius_opp .
    ?sirius_opp rdf:type emi:StructuralAnnotation .
    ?sirius_opp prov:wasGeneratedBy ?activiy_opp .
    ?activiy_opp prov:wasAssociatedWith <https://bio.informatik.uni-jena.de/software/sirius> .
    ?sirius_opp emi:hasChemicalStructure ?ik2d

    FILTER(((?rt - ?rt_tol) < ?rt_opp) && ((?rt + ?rt_tol) > ?rt_opp))
    }

Query 3:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>          
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX emi: <https://purl.org/emi#>
PREFIX sosa: <http://www.w3.org/ns/sosa/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX prov: <http://www.w3.org/ns/prov#>
SELECT DISTINCT ?feature ?feature_opp ?ik2d ?rt ?rt_opp
WHERE
    {
    ?sample rdf:type emi:ExtractSample .
    ?sample sosa:isSampleOf ?a .
    ?a       sosa:isSampleOf ?organe .
    ?organe emi:inTaxon ?taxon . 
    ?taxon rdfs:label "melochia umbellata" .
    ?sample sosa:isFeatureOfInterestOf ?lcms .
    ?lcms sosa:hasResult ?feature_list .  
    ?lcms rdf:type emi:LCMSAnalysisPos .
    ?feature_list emi:hasLCMSFeature ?feature .
    ?feature emi:hasRetentionTime  ?rt .
    ?feature emi:hasAnnotation ?sirius .
    ?sirius rdf:type emi:StructuralAnnotation .
    ?sirius prov:wasGeneratedBy ?activiy .
    ?activiy prov:wasAssociatedWith <https://bio.informatik.uni-jena.de/software/sirius> .
    ?sirius emi:hasChemicalStructure ?ik2d .
    ?sample sosa:isFeatureOfInterestOf ?lcms_opp .
    ?lcms_opp rdf:type emi:LCMSAnalysisNeg .
    ?lcms_opp sosa:hasResult ?feature_list_opp .
    ?feature_list_opp emi:hasLCMSFeature ?feature_opp .
    ?feature_opp emi:hasRetentionTime ?rt_opp .                
    ?feature_opp emi:hasAnnotation ?sirius_opp .
    ?sirius_opp rdf:type emi:StructuralAnnotation .
    ?sirius_opp prov:wasGeneratedBy ?activiy_opp .
    ?activiy_opp prov:wasAssociatedWith <https://bio.informatik.uni-jena.de/software/sirius> .
    ?sirius_opp emi:hasChemicalStructure ?ik2d

    #FILTER(((?rt - ?rt_tol) < ?rt_opp) && ((?rt + ?rt_tol) > ?rt_opp))
    }limit 100

hannahbast commented 1 month ago

@tarcisiotmf The problem is that QLever's query planner currently generates all possible query plans (and then chooses the best). For large queries like yours, a heuristic is needed to limit the number of possible query plans that QLever evaluates.

It is on TODO list to do this automatically. Until then, you can easily do it manually by grouping parts of the query { ... }. Then for each such part, all query plans are generated and the best query plans for each part are combined. You should choose the groups, so that each group by itself makes sense by itself. The smaller the final result for each group, the better.

Please try it and let us know if it worked for you.

tarcisiotmf commented 1 month ago

Thanks for the clarification! Currently, we are mostly evaluating available RDF stores for the different types of data and use cases we have at the SIB Swiss Institute of Bioinformatics including query examples in use (real-world applications).

Actually, the majority of our use cases involves queries that are indeed large. We also develop Question Answer (QA) systems over different independent SPARQL endpoints (with different RDF technologies). Then it would be significantly complex for us to handle case by case tailoring solutions to a specific RDF store by also considering that the queries are generated automatically by the QA system.

I am really impressed by the latest Qlever developments, once this issue is solved, it will be highly relevant for us.

Thanks again for your quick reply and support!

ad-freiburg / qlever

Queries with more than 20 triple patterns are never solved #1428