SuLab / WikiGenomesBase

A configurable codebase for launching organism specific WikiGenomes spinoff applications (e.g. ChlamBase.org) This is a web application framework for creating a model organism database leveraging the taxonomic, genetic and functional data that has been loaded to Wikidata.org by the Gene Wiki Project.
https://chlambase.org/
MIT License
7 stars 4 forks source link

error retrieving GO terms from Wikidata #221

Closed andrewsu closed 4 years ago

andrewsu commented 4 years ago

on this Chlambase page: https://chlambase.org/organism/471472/gene/CTL0784

we are executing this query%20(GROUP_CONCAT(DISTINCT%20%3Fdetermination%3B%20SEPARATOR%20%3D%20%27%3B%27)%20AS%20%3FdeterminationLabel)%20WHERE%20%7B%3Fprotein%20wdt%3AP352%20%27Q9PJM0%27.%3Fprotein%20(p%3AP680%7Cp%3AP681%7Cp%3AP682)%2B%20%3Fgoterm.%3Fgoterm%20pq%3AP459%2Frdfs%3Alabel%20%3Fdetermination.%20FILTER(LANG(%3Fdetermination)%20%3D%20%27en%27).OPTIONAL%20%7B%20%3Fgoterm%20(prov%3AwasDerivedFrom%2Fpr%3AP248)%2Frdfs%3Alabel%20%3Freference_stated_label.%20FILTER(LANG(%3Freference_stated_label)%20%3D%20%27en%27).%7DOPTIONAL%20%7B%20%3Fgoterm%20(prov%3AwasDerivedFrom%2Fpr%3AP813)%20%3Freference_retrieved.%20%7D%3Fgoterm%20(ps%3AP680%7Cps%3AP681%7Cps%3AP682)%2B%20%3FgotermValue.%3FgotermValue%20wdt%3AP31%20%3Fgoclass.%3FgotermValue%20wdt%3AP686%20%3FgoID.SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%27en%27.%20%7D%7DGROUP%20BY%20%3FgotermValueLabel%20%3FgoID%20%3FgotermValue%20%3Fgoclass%20%3FdeterminationLabel%20%3Freference_retrievedLabel), which after formatting corresponds to this wikidata SPARQL query:

SELECT ?gotermValueLabel ?goID ?gotermValue ?goclass ?reference_retrievedLabel (GROUP_CONCAT(DISTINCT ?reference_stated_label; SEPARATOR = '; ') AS ?reference_stated_label) (GROUP_CONCAT(DISTINCT ?determination; SEPARATOR = ';') AS ?determinationLabel) 
WHERE {
  ?protein wdt:P352 'Q9PJM0'.
  ?protein (p:P680|p:P681|p:P682)+ ?goterm.
  ?goterm pq:P459/rdfs:label ?determination. 
  FILTER(LANG(?determination) = 'en').
  OPTIONAL { 
    ?goterm (prov:wasDerivedFrom/pr:P248)/rdfs:label ?reference_stated_label. 
    FILTER(LANG(?reference_stated_label) = 'en').
  }
  OPTIONAL { 
    ?goterm (prov:wasDerivedFrom/pr:P813) ?reference_retrieved. 
  }
  ?goterm (ps:P680|ps:P681|ps:P682)+ ?gotermValue.
  ?gotermValue wdt:P31 ?goclass.
  ?gotermValue wdt:P686 ?goID.
  SERVICE wikibase:label { bd:serviceParam wikibase:language 'en'. }
} GROUP BY ?gotermValueLabel ?goID ?gotermValue ?goclass ?determinationLabel ?reference_retrievedLabel

which returns a Unknown error: java.lang.StackOverflowError error.

Removing the two GROUP_CONCAT clauses in the select statement (link to query) results in a successful execution for this particular uniprot ID (Q9PJM0). What are those GROUP_CONCAT statements supposed to do, and should we debug/fix or remove?

andrewsu commented 4 years ago

FWIW, similar issue with this query that is also failing:

SELECT (GROUP_CONCAT(?eb) AS ?eb) (GROUP_CONCAT(?rb) AS ?rb) ?pmid (GROUP_CONCAT(?increased) AS ?increased) 
WHERE {
  ?protein wdt:P352 'B0B896'.
  ?protein p:P5572+ ?claim.
  ?claim ps:P5572 ?form.
  ?claim prov:wasDerivedFrom/pr:P248/wdt:P698 ?pmid.
  BIND(IF(?form = wd:Q51955212, '+', '') AS ?eb).
  BIND(IF(?form = wd:Q51955198, '+', '') AS ?rb).
  OPTIONAL {
    ?protein wdt:P1911 ?form.BIND(IF(?form = wd:Q51955212, 'eb', 'rb') AS ?increased).
  }
} GROUP BY ?pmid
andrawaag commented 4 years ago

as per similar issue on phabricator , the issue apparently relates to reuse of the same variable names. If the GROUP_CONCAT(DISTINCT ?reference_stated_label; SEPARATOR = '; ') AS ?reference_stated_label) is changed into GROUP_CONCAT(DISTINCT ?reference_stated_label; SEPARATOR = '; ') AS ?reference_stated_label_) the query gets results

SELECT ?gotermValueLabel ?goID ?gotermValue ?goclass ?reference_retrievedLabel (GROUP_CONCAT(DISTINCT ?reference_stated_label; SEPARATOR = '; ') AS ?reference_stated_label_) (GROUP_CONCAT(DISTINCT ?determination; SEPARATOR = ';') AS ?determinationLabel) 
WHERE {
  ?protein wdt:P352 'Q9PJM0'.
  ?protein (p:P680|p:P681|p:P682)+ ?goterm.
  ?goterm pq:P459/rdfs:label ?determination. 
  FILTER(LANG(?determination) = 'en').
  OPTIONAL { 
    ?goterm (prov:wasDerivedFrom/pr:P248)/rdfs:label ?reference_stated_label. 
    FILTER(LANG(?reference_stated_label) = 'en').
  }
  OPTIONAL { 
    ?goterm (prov:wasDerivedFrom/pr:P813) ?reference_retrieved. 
  }
  ?goterm (ps:P680|ps:P681|ps:P682)+ ?gotermValue.
  ?gotermValue wdt:P31 ?goclass.
  ?gotermValue wdt:P686 ?goID.
  SERVICE wikibase:label { bd:serviceParam wikibase:language 'en'. }
} GROUP BY ?gotermValueLabel ?goID ?gotermValue ?goclass ?determinationLabel ?reference_retrievedLabel

GROUP_CONCAT agregates multiple values into one cell if the other fields are identical across different records. Without GROUP_CONCAT the query would return multiple records where only the to be concatenated record differs across the results.

andrawaag commented 4 years ago

Likewise if the duplicate variable names in the GROUP_CONCAT are named differently the query return results.

SELECT (GROUP_CONCAT(?eb) AS ?ebs) (GROUP_CONCAT(?rb) AS ?rbs) ?pmid (GROUP_CONCAT(?increased_value) AS ?increased_values) 
WHERE {
  ?protein wdt:P352 'B0B896'.
  ?protein p:P5572+ ?claim.
  ?claim ps:P5572 ?form.
  ?claim prov:wasDerivedFrom/pr:P248/wdt:P698 ?pmid.
  BIND(IF(?form = wd:Q51955212, '+', '') AS ?eb).
  BIND(IF(?form = wd:Q51955198, '+', '') AS ?rb).
  OPTIONAL {
    ?protein wdt:P1911 ?form.BIND(IF(?form = wd:Q51955212, 'eb', 'rb') AS ?increased).
  }
} GROUP BY ?pmid

results

Apparently, it was possible to reuse the single variable as the aggregated variable, but a update seems to have ended that feature.

In the related phabricator task it is discussed to reintroduce this.

andrewsu commented 4 years ago

fixed in d46100c3c88445365d843ca5586144dccc8fbe90 and 56cb6815201266d2590212beab957737adfc1290