ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
417 stars 52 forks source link

Failed to parse the Service result as JSON #1427

Open tarcisiotmf opened 3 months ago

tarcisiotmf commented 3 months ago

When executing the query below with qlever the error below is raised. I have executed the same query in graphdb and it worked as expected. You can replicate the error with the following links:

Executing query with qlever%0A%20%20%20%20%20%20%20%20%20%20%20%20WHERE%0A%20%20%20%20%20%20%20%20%20%20%20%20%7B%20%20%20%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fextract%20rdf%3Atype%20emi%3AExtractSample%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fextract%20sosa%3AisFeatureOfInterestOf%20%3Flcms%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Flcms%20rdf%3Atype%20emi%3ALCMSAnalysis%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Flcms%20emi%3AhasLCMSFeatureSet%20%3Ffeature_list%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Ffeature_list%20emi%3AhasLCMSFeature%20%3Ffeature%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Ffeature%20emi%3AhasAnnotation%20%3Fcanopus%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%09%3Fcanopus%20rdf%3Atype%20emi%3AChemicalTaxonAnnotation%20.%20%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fcanopus%20emi%3AhasClass%20%3Fnp_class%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%09%3Fnp_class%20rdfs%3Alabel%20%22Aspidosperma%20type%22%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fcanopus%20emi%3AhasClassProbability%20%3Fclass_prob%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20FILTER((%3Fclass_prob%20%3E%200.5))%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%7D%20GROUP%20BY%20%3Fextract%20ORDER%20BY%20DESC(%3Fcount_of_selected_class)%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%20%20%0A)

Executing query with graphdb, select emi-dbgi repository%0A%20%20%20%20%20%20%20%20%20%20%20%20WHERE%0A%20%20%20%20%20%20%20%20%20%20%20%20%7B%20%20%20%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fextract%20rdf%3Atype%20emi%3AExtractSample%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fextract%20sosa%3AisFeatureOfInterestOf%20%3Flcms%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Flcms%20rdf%3Atype%20emi%3ALCMSAnalysis%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Flcms%20emi%3AhasLCMSFeatureSet%20%3Ffeature_list%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Ffeature_list%20emi%3AhasLCMSFeature%20%3Ffeature%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Ffeature%20emi%3AhasAnnotation%20%3Fcanopus%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%09%3Fcanopus%20rdf%3Atype%20emi%3AChemicalTaxonAnnotation%20.%20%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fcanopus%20emi%3AhasClass%20%3Fnp_class%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%09%3Fnp_class%20rdfs%3Alabel%20%22Aspidosperma%20type%22%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fcanopus%20emi%3AhasClassProbability%20%3Fclass_prob%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20FILTER((%3Fclass_prob%20%3E%200.5))%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%7D%20GROUP%20BY%20%3Fextract%20ORDER%20BY%20DESC(%3Fcount_of_selected_class)%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%20%20%0A)

The dataset used in our test is available here.

exception": "Failed to parse the Service result as JSON. First 100 bytes: SPARQL-QUERY: queryStr=...

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX emi: <https://purl.org/emi#>
PREFIX sosa: <http://www.w3.org/ns/sosa/>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?extract ?organe ?species_name ?genus_name ?family_name ?count_of_selected_class
WHERE
    {  
    ?material sosa:hasSample ?extract .
        ?material sosa:isSampleOf ?organe .
        ?organe emi:inTaxon ?wd_sp .
        OPTIONAL
        {
            SERVICE <https://query.wikidata.org/sparql> {
            ?wd_sp wdt:P225 ?species_name .
            ?family wdt:P31 wd:Q16521 ;
                wdt:P105 wd:Q35409 ;
                wdt:P225 ?family_name ;
                ^wdt:P171* ?wd_sp .
            ?genus wdt:P31 wd:Q16521 ;
                wdt:P105 wd:Q34740 ;
                wdt:P225 ?genus_name ;
                ^wdt:P171* ?wd_sp 
            }
        }
        {
            SELECT ?extract (COUNT(DISTINCT ?feature) AS ?count_of_selected_class)
            WHERE
            {   
                ?extract rdf:type emi:ExtractSample .
                ?extract sosa:isFeatureOfInterestOf ?lcms .
                ?lcms rdf:type emi:LCMSAnalysis .
                ?lcms emi:hasLCMSFeatureSet ?feature_list .
                ?feature_list emi:hasLCMSFeature ?feature .
                ?feature emi:hasAnnotation ?canopus .
                ?canopus rdf:type emi:ChemicalTaxonAnnotation . 
                ?canopus emi:hasClass ?np_class .
                ?np_class rdfs:label "Aspidosperma type" .
                ?canopus emi:hasClassProbability ?class_prob .
                FILTER((?class_prob > 0.5)) .
            } GROUP BY ?extract ORDER BY DESC(?count_of_selected_class)
        }
    }  
tuukka commented 3 months ago

"the Service result" probably means what query.wikidata.org returns to QLever. The result clearly is not valid JSON as it starts with SPARQL-QUERY (and not with a JSON object). The query is quite slow, so it could be a timeout in query.wikidata.org.

You could use QLever's Wikidata endpoint instead, if you change the SERVICE to the following: SERVICE <https://qlever.cs.uni-freiburg.de/api/wikidata>

tuukka commented 3 months ago

If I switch from WDQS to QLever Wikidata, the error is different: Blank nodes in the result of a SERVICE are currently not supported. For now, consider filtering them out using the ISBLANK function or converting them via the STR function.

Finally, after I added the ISBLANK filters, it timed out after 120 seconds: https://qlever.cs.uni-freiburg.de/wikidata/1gsnSL

tarcisiotmf commented 3 months ago

Thanks for the quick reply and support! I am really impressed with the performance of Qlever demos.

I have changed the service request to your SPARQL endpoint, and I had the following error after more than 1 minute of execution. However, the subquery in the service query has only projections without blank nodes as values.

{
    "exception": "Blank nodes in the result of a SERVICE are currently not supported. For now, consider filtering them out using the ISBLANK function or converting them via the STR function.",
    "query": "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\nPREFIX wdt: <http://www.wikidata.org/prop/direct/>\nPREFIX wd: <http://www.wikidata.org/entity/>\nPREFIX emi: <https://purl.org/emi#>\nPREFIX sosa: <http://www.w3.org/ns/sosa/>\nPREFIX prov: <http://www.w3.org/ns/prov#>\nPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n\nSELECT ?extract ?organe ?species_name ?genus_name ?family_name ?count_of_selected_class\nWHERE\n    {  \n    ?material sosa:hasSample ?extract .\n        ?material sosa:isSampleOf ?organe .\n        ?organe emi:inTaxon ?wd_sp .\n        OPTIONAL\n        {\n            SERVICE  <https://qlever.cs.uni-freiburg.de/api/wikidata> {\n            ?wd_sp wdt:P225 ?species_name .\n            ?family wdt:P31 wd:Q16521 ;\n                wdt:P105 wd:Q35409 ;\n                wdt:P225 ?family_name ;\n                ^wdt:P171* ?wd_sp .\n            ?genus wdt:P31 wd:Q16521 ;\n                wdt:P105 wd:Q34740 ;\n                wdt:P225 ?genus_name ;\n                ^wdt:P171* ?wd_sp \n            }\n        }\n        {\n            SELECT ?extract (COUNT(DISTINCT ?feature) AS ?count_of_selected_class)\n            WHERE\n            {   \n                ?extract rdf:type emi:ExtractSample .\n                ?extract sosa:isFeatureOfInterestOf ?lcms .\n                ?lcms rdf:type emi:LCMSAnalysis .\n                ?lcms emi:hasLCMSFeatureSet ?feature_list .\n                ?feature_list emi:hasLCMSFeature ?feature .\n                ?feature emi:hasAnnotation ?canopus .\n            \t?canopus rdf:type emi:ChemicalTaxonAnnotation . \n                ?canopus emi:hasClass ?np_class .\n            \t?np_class rdfs:label \"Aspidosperma type\" .\n                ?canopus emi:hasClassProbability ?class_prob .\n                FILTER((?class_prob > 0.5)) .\n            } GROUP BY ?extract ORDER BY DESC(?count_of_selected_class)\n        }\n    }  \n",
    "resultsize": 0,

The second issue to rely on your Wikidata endpoint, it may result on not having access to the latest Wikidata data. Moreover, the no JSON response from Wikidata is due to an error in Wikidata (out of memory). After a quick test, if we try to run the Wikidata part of the query without considering the results from the outer query (qlever BGP processing), we get either a timeout or out of memory error because Qlever is not considering the results of the outer query to filter the results from the inner query (service subquery - query plan related issue?).

SPARQL-QUERY: queryStr=select ?wd_sp ?species_name ?genus_name ?family_name {
            ?wd_sp wdt:P225 ?species_name .
            ?family wdt:P31 wd:Q16521 ;
                wdt:P105 wd:Q35409 ;
                wdt:P225 ?family_name ;
                ^wdt:P171* ?wd_sp .
            ?genus wdt:P31 wd:Q16521 ;
                wdt:P105 wd:Q34740 ;
                wdt:P225 ?genus_name ;
                ^wdt:P171* ?wd_sp  } #limit 100
java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.QueryEvaluationException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.bigdata.rwstore.sector.MemoryManagerOutOfMemory
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:206)
tuukka commented 3 months ago

However, the subquery in the service query has only projections without blank nodes as values.

Here's an example of a taxon whose taxon name is an "unknown value" in Wikidata, and this is represented as a blank node in RDF: https://www.wikidata.org/wiki/Q21362983

If the top-level query result shouldn't have any such results, it may be a signal that the query plan is indeed non-optimal regarding this subquery.

The second issue to rely on your Wikidata endpoint, it may result on not having access to the latest Wikidata data.

I believe there's work in progress to implement rolling updates in QLever's Wikidata endpoint. [I'm not a member of the QLever developer team but a Wikidata contributor.]

Moreover, the no JSON response from Wikidata is due to an error in Wikidata (out of memory).

Right - WDQS is known to have scaling issues, but this could also be because of QLever making a non-optimal query plan?

we get either a timeout or out of memory error because Qlever is not considering the results of the outer query to filter the results from the inner query (service subquery - query plan related issue?).

Just to clarify, are you saying that QLever is not sending the values of ?wd_sp (count 747) from the outer query to WDQS for the subquery? (It would be nice to have a way to see the exact subquery that is being sent to WDQS.)

If you have a QLever UI for your endpoint, you can click the button "Analysis" to view the query plan. (Otherwise, it's included in the JSON response which may be difficult to read.)

I don't know if the heuristics can be tweaked, but by reorganising the subqueries, I'm indeed getting the query to complete (with one result - is that the correct result?): https://qlever.cs.uni-freiburg.de/wikidata/uAxDnP

tarcisiotmf commented 3 months ago

Thanks, please see below my replies:

Right - WDQS is known to have scaling issues, but this could also be because of QLever making a non-optimal query plan?

Yes, I think so.

Just to clarify, are you saying that QLever is not sending the values of ?wd_sp (count 747) from the outer query to WDQS for the subquery? It looks like to be it since we are getting out of memory from Wikidata. When querying with a limit it works. Or explicitly assigning ?wd_sp in the subquery.

If you have a QLever UI for your endpoint, you can click the button "Analysis" to view the query plan. (Otherwise, it's included in the JSON response which may be difficult to read.)

Sorry, I don't have it but I have provided the query and dataset to replicate the issue.

I don't know if the heuristics can be tweaked, but by reorganising the subqueries, I'm indeed getting the query to complete (with one result - is that the correct result?): https://qlever.cs.uni-freiburg.de/wikidata/uAxDnP

Yes, thanks. You can verify it with the link I provided when I opened this issue. When executing the same query in graphdb (that is also faster for this query) at the sparql endpoint: https://biosoda.unil.ch/graphdb/repositories/emi-dbgi) :

querying with graphdb%0A%20%20%20%20%20%20%20%20%20%20%20%20WHERE%0A%20%20%20%20%20%20%20%20%20%20%20%20%7B%20%20%20%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fextract%20rdf%3Atype%20emi%3AExtractSample%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fextract%20sosa%3AisFeatureOfInterestOf%20%3Flcms%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Flcms%20rdf%3Atype%20emi%3ALCMSAnalysis%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Flcms%20emi%3AhasLCMSFeatureSet%20%3Ffeature_list%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Ffeature_list%20emi%3AhasLCMSFeature%20%3Ffeature%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Ffeature%20emi%3AhasAnnotation%20%3Fcanopus%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%09%3Fcanopus%20rdf%3Atype%20emi%3AChemicalTaxonAnnotation%20.%20%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fcanopus%20emi%3AhasClass%20%3Fnp_class%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%09%3Fnp_class%20rdfs%3Alabel%20%22Aspidosperma%20type%22%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fcanopus%20emi%3AhasClassProbability%20%3Fclass_prob%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20FILTER((%3Fclass_prob%20%3E%200.5))%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20%7D%20GROUP%20BY%20%3Fextract%20ORDER%20BY%20DESC(%3Fcount_of_selected_class)%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%20%20%0A)

tuukka commented 3 months ago

Finally, after I added the ISBLANK filters, it timed out after 120 seconds: https://qlever.cs.uni-freiburg.de/wikidata/1gsnSL

@hannahbast Could you have a brief look at this issue? Basically, are the statements within a SERVICE clause always sent to the remote endpoint as is and the join done afterwards, or is it somehow possible to get QLever to evaluate local triple patterns first and send the resulting bindings to the remote endpoint (by adding VALUES statements, I imagine)?

joka921 commented 3 months ago

Hi @tuukka and @tarcisiotmf ,

Thanks for your interest in QLever. I found the time to look at your issue, and I can say the following:

  1. The limitation concerning Blank Nodes in Service Queries is unfortunate, but will be fixed eventually (I have to assign someone or myself to it:)), but it can typically be worked around.
  2. Since rather recently, QLever in principle supports the constraining of SERVICE queries by enriching them with VALUES clauses from the enclosing query. This mechanism currently has two limitations: 2.1. It only sends the VALUES clause, if it has at most 100 entries. This default is too low, it can be set by issuing a GET request to <urlOfTheSparqlServer>/?service-max-value-rows=<newIntegerValue>&access-token=<yourAccessToken> . However This is not a problem for your concrete query, the outer context only boils down to a single result. 2.2 (Very relevant for you): QLever does NOT know how to constrain a SERVICE, if the service is inside an OPTIONAL clause, and the constraining triples are outside the OPTIONAL. In General, QLever is currently not good at optimizing OPTIONALs, we always fully evaluate the content of the OPTIONAL and then join it with Everything that stands before it in the query.

You current workarounds thus are:

  1. Drop the OPTIONAL around the SERVICE (then your query works in a reasonable time).
  2. Duplicate the Constraining context INSIDE the OPTIONAL (ugly, but allows to preserve the exact semantics).

In general: Nice that you have set up a local SPARQL endpoint for your data. I would highly recommend to also set up the qlever ui, as its analysis capabilities are really really useful, especially when sharing the results.

tuukka commented 3 months ago
  1. Drop the OPTIONAL around the SERVICE (then your query works in a reasonable time).

Nice! With this approach, I'm seeing times a bit below 3s when the cache is clean.

With the better query plan, you can also go back to federating to query.wikidata.org if you wish: https://qlever.cs.uni-freiburg.de/wikidata/BkGzpj

tuukka commented 3 months ago

2. In General, QLever is currently not good at optimizing OPTIONALs, we always fully evaluate the content of the OPTIONAL and then join it with Everything that stands before it in the query.

By the way, I think this is something that a lot of queries currently suffer from and it seems it's not easy to work around.

@joka921 Could a solution to this be prioritized? For example, we add optional labels and other supplementary information as an afterthought with the intuition that for a reasonable number of results, it shouldn't be a lot of additional computation. I see #314 is related but this one should be much easier - more or less the same as sending VALUES to a SERVICE clause? Should we open a separate issue?

tarcisiotmf commented 3 months ago

@joka921 Thank you very much for the clarification about the current Qlever limitations. Thanks, @tuukka, for your interest in this issue, it has been really helpful !

Please, do as you think it is more appropriate (closing this issue, and opening new issues). In my opinion, there are 3 different issues resulted from this one:

joka921 commented 4 weeks ago

Hi, There is an update to your various issues: QLever has been supporting the SERVICE/OPTIONAL optimization which is required for your query for a while now. And since two minute ago, QLever also supports blank nodes in the result of SERVICE requests. Note that this requires rebuilding your index. Please try out, if your original query works with the current master (the docker image will need some hours until i updated, The latest relevant PR that was merged was #1504 ).