WDscholia / scholia

Wikidata-based scholarly profiles
https://scholia.toolforge.org
Other
215 stars 78 forks source link

Look into QLever as a potential query engine for Scholia #1774

Open Daniel-Mietchen opened 2 years ago

Daniel-Mietchen commented 2 years ago

Is your feature request related to a problem? Please describe. Scholia currently queries the Wikidata Query Service, which currently relies on Blazegraph, which is suspected to fail within the next few years, as per

Describe the solution you'd like One of the options to address this is to use another query engine, e.g. QLever.

Describe alternatives you've considered

Daniel-Mietchen commented 2 years ago

Over in

we had a brief discussion about QLever, including an example query https://qlever.cs.uni-freiburg.de/wikidata/J8PSek for which I am pasting a screenshot below: Screenshot 2022-01-27 at 23-10-35 The QLever SPARQL engine fast, scalable, with autocompletion and text search

WolfgangFahl commented 2 years ago

The corresponding query on Wikidata Query Services takes 2.7. secs to run as of 2022-01-28

With a pdf filter it is slightly slower.

WolfgangFahl commented 2 years ago

The more elaborate query which should show the authorslist: Times out on wikidata

fails on qlever

WolfgangFahl commented 2 years ago
# 
# Example Query for 
# https://github.com/WDscholia/scholia/issues/1774
#
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
# Scholarly articles with full text
SELECT ?paper ?paperLabel ?publishedIn ?publishedInLabel ?event ?eventLabel ?fullText ?authors
WHERE 
{
  ?paper wdt:P31 wd:Q13442814.
  ?paper rdfs:label ?paperLabel. 
  filter(lang(?paperLabel) = "en").
  ?paper wdt:P953 ?fullText. 
  ?paper wdt:P953 ?fullText filter (strends(str(?fullText), ".pdf" )). 
  #filter(regex(?fullText, "\\.pdf\\>$" )). 
  ?paper wdt:P1433 ?publishedIn.
  ?publishedIn rdfs:label ?publishedInLabel. 
  filter(lang(?publishedInLabel) = "en" ).
  ?publishedIn wdt:P4745 ?event. 
  ?event rdfs:label ?eventLabel. 
  filter(lang(?eventLabel) = "en"). 
  OPTIONAL
  {    
     SELECT (GROUP_CONCAT(?authorLabel) as ?authors) WHERE {
       ?paper wdt:P50 ?author.
       ?author rdfs:label ?authorLabel filter(lang(?authorLabel) = 'en').
     } GROUP BY ?paper
  }
}
WolfgangFahl commented 2 years ago

The authors query:

#
# test Query for https://github.com/WDscholia/scholia/issues/1774
# WF 2022-01-28
#
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
# Authors
SELECT ?work ?workLabel ?author ?authorLabel
WHERE 
{
  ?work wdt:P50 ?author. 
  ?work rdfs:label ?workLabel .
  ?author rdfs:label ?authorLabel. 
}
LIMIT 10

runs on qlever in some 30 secs with >300 million results (limited to 10) while the wikidata query service takes only 0.2 s

WolfgangFahl commented 2 years ago

Looks like we need a proper set of queries to do a fair comparison and check for compatibility.

dpriskorn commented 2 years ago

If we find that QLever is a good alternative for Scholia, I would like to help set it up in the WMC Toolforge Kubernetes cluster :)

Daniel-Mietchen commented 2 years ago

Just for the record, there is a Phabricator ticket to Evaluate QLever as a time lagging SPARQL backend to offload the BlazeGraph cluster with overlapping threads and participants.

Daniel-Mietchen commented 2 years ago

Looks like we need a proper set of queries to do a fair comparison and check for compatibility.

Perhaps some Scholia queries could be part of that benchmarking set.

Daniel-Mietchen commented 2 years ago

If we find that QLever is a good alternative for Scholia, I would like to help set it up in the WMC Toolforge Kubernetes cluster :)

Perhaps we won't find out whether it can be that alternative if we do not have such test instances to play around with and to test the workflows (e.g. including exports/ dumps and database refreshes).

egonw commented 4 months ago

Testing is easier now, and we should be able to just change this line to test with Qlever:

https://github.com/egonw/scholia/blob/8ff64dee13940ad28fd5d7dd97ad4bdc2d2628b4/scholia/query.py#L64

dpriskorn commented 4 months ago
# 
# Example Query for 
# https://github.com/WDscholia/scholia/issues/1774
#
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
# Scholarly articles with full text
SELECT ?paper ?paperLabel ?publishedIn ?publishedInLabel ?event ?eventLabel ?fullText ?authors
WHERE 
{
  ?paper wdt:P31 wd:Q13442814.
  ?paper rdfs:label ?paperLabel. 
  filter(lang(?paperLabel) = "en").
  ?paper wdt:P953 ?fullText. 
  ?paper wdt:P953 ?fullText filter (strends(str(?fullText), ".pdf" )). 
  #filter(regex(?fullText, "\\.pdf\\>$" )). 
  ?paper wdt:P1433 ?publishedIn.
  ?publishedIn rdfs:label ?publishedInLabel. 
  filter(lang(?publishedInLabel) = "en" ).
  ?publishedIn wdt:P4745 ?event. 
  ?event rdfs:label ?eventLabel. 
  filter(lang(?eventLabel) = "en"). 
  OPTIONAL
  {    
     SELECT (GROUP_CONCAT(?authorLabel) as ?authors) WHERE {
       ?paper wdt:P50 ?author.
       ?author rdfs:label ?authorLabel filter(lang(?authorLabel) = 'en').
     } GROUP BY ?paper
  }
}

I found a small bug in that query, see where it is fixed (there was a missing str() in the regex filter): https://qlever.cs.uni-freiburg.de/wikidata/KBDY3n My guess: The author part times out because of a missing index in QLever, we are doing something the designers have not tested.

WolfgangFahl commented 4 months ago

@dpriskorn Note how #2412 is intended as providing a viable migration path. And yes - we'd like to run a bunch of wikidata endpoints with different technologies in the Wikimedia Foundations data center. Who would be our contact for this?

fnielsen commented 4 months ago

Do we have a place to setup a Synia webapp? That could go to that endpoint and we could more easily test queries.