Augment Wikidata graph with ranking information better than the sitelinks count

ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.

Apache License 2.0

369 stars 44 forks source link

Augment Wikidata graph with ranking information better than the sitelinks count #676

Open tuukka opened 2 years ago

tuukka commented 2 years ago

Would it be easy to add some ranking information (triples) to the Wikidata endpoint? This has been discussed for years elsewhere (T143424 T174981), but I'm not aware of a query endpoint that would provide this yet. Here's two open-sourced rankings that I could find:

QRank (pageviews): https://qrank.wmcloud.org/ Danker (PageRank): https://danker.s3.amazonaws.com/index.html

hannahbast commented 2 years ago

Adding triples for ranking would be rather easy, but I have a question:

We always use ^schema:about/wikibase:sitelinks for ranking. This counts the number of Wikimedia pages of an entity and is a very good proxy for popularity (and a much better proxy than, for example, the number of triples an entity is involved in). For example, here is a list of all people in Wikidata ranked by the number of sitelinks: https://qlever.cs.uni-freiburg.de/wikidata/kfJfrG

Have you tried ^schema:about/wikibase:sitelinks or is there anything that you don't like about it?

tuukka commented 2 years ago

I am using the sitelinks count but I see it as just one metric:

sitelinks measures how "global" the notability and interest towards a topic is among Wikimedia contributors
pageviews measures how much readers a topic has among the general public
PageRank measures the "centrality" and connectedness of the topic in the Wikimedia graph

My current use case is reimplementing wikitrivia-generator, which is currently heavy and slow:

First it needs a full Wikidata dump (more or less solved with a large Wikidata query in QLever).
Then it makes pageviews API calls one-by-one, which takes days.

See more on the pain here: https://github.com/tom-james-watson/wikitrivia/issues/26#issuecomment-1138336378

hannahbast commented 2 years ago

@tuukka Do you have a demo of what the wikitrivia-generator does? Without fully understanding yet, what you want, a viable approach might be:

Get the appropriate subset from Wikidata via a CONSTRUCT query
Build a QLever instance for that subset
Ask queries to that instance

Don't be afraid of building and running a qlever instance, it's as simple as this in a directory with a TTL file (which could be obtained via a CONSTRUCT query), using the qlever script:

. qlever      # Configure
qlever index  # Build index
qlever start  # Start the server

tuukka commented 2 years ago

Here's the original game: https://wikitrivia.tomjwatson.com/

Here's the game data file as produced by wikitrivia-generator (in English, with items that were once generated and never updated, as it's too much hassle): https://wikitrivia-data.tomjwatson.com/items.json

So far, some people seem to have been able to fork the script and run it in their own language with more or less success: Basque, Romanian.

Ideally, it would be possible for the player to pick any language supported by Wikidata, and the game could make a suitable Sparql query to get a fresh set of up-to-date items for that language and no other backend infrastructure was needed.

You are right, it is also possible to implement this query without using the official QLever instance for now, and this issue could be tagged wishlist :-)

hannahbast commented 2 years ago

Thanks for the explanation, now I understand. For this kind of application, asking a Wikidata SPARQL endpoint from time to time seems to be the method of choice.

But isn't then a query like https://qlever.cs.uni-freiburg.de/wikidata/m76Lrg doing exactly what you need? It works for any language and takes 20 - 30 seconds.