ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
372 stars 44 forks source link

Load SDC dump to Wikidata endpoint #985

Open tuukka opened 1 year ago

tuukka commented 1 year ago

Have you considered supporting SDC (Structured Data on Commons) yet? It is the wikibase of metadata regarding images within Wikimedia Commons, and it uses Wikidata items and properties as its vocabulary. In practice, it expands Wikidata with more images and depiction information.

The support might be as easy as loading the SDC dump into the Wikidata endpoint. Alternatively, there could be a separate SDC endpoint but it would also need to contain (a subset of) Wikidata.

The RDF dumps are available here: https://dumps.wikimedia.org/other/wikibase/commonswiki/

More on SDC: https://commons.wikimedia.org/wiki/Commons:Structured_data

EDIT: Documentation of the triples in the dump: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#MediaInfo and https://www.mediawiki.org/wiki/Extension:WikibaseMediaInfo/RDF_mapping

hannahbast commented 1 year ago

@tuukka Thanks for the suggestion. I am downloading it right now (that takes a few hours) and will build a QLever instance for it (that will take a few more hours). Looking forward to what's in there, especially since it appears to be quite big (37 GB bz2-compressed).

Do you know why the WCQS does not have unauthenticated access like the WDQS does?

And can you provide one or two useful example queries?

tuukka commented 1 year ago

Do you know why the WCQS does not have unauthenticated access like the WDQS does?

As I understand it: performance reasons - WMF is unwilling to provide more endpoints while there is no solution to the performance needs of the regular Wikidata Query Service either.

tuukka commented 1 year ago

And can you provide one or two useful example queries?

Here's some WCQS example queries from the community: https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service/queries/examples

To start with a comparison, in Wikidata, you get image(s) of Douglas Adams like this:

SELECT ?image {
    wd:Q42 wdt:P18 ?image . # Douglas Adams
}

The result is e.g. image http://commons.wikimedia.org/wiki/Special:FilePath/Douglas%20adams%20portrait%20cropped.jpg

In SDC, you can get the image above and all other images depicting Douglas Adams like this:

SELECT ?file ?image {
    ?file wdt:P180 wd:Q42 . # depicts: Douglas Adams
    ?file schema:url ?image .
}

And the result is e.g. file https://commons.wikimedia.org/entity/M10031710 image http://commons.wikimedia.org/wiki/Special:FilePath/Douglas%20adams%20portrait%20cropped.jpg so the same image URL as above.

Combining ontology information from Wikidata, you can query e.g. all quality images depicting any hummingbird species: [original source]

SELECT ?file ?image {
    ?species wdt:P171/wdt:P171* wd:Q43624. # parent taxon: hummingbird
    ?file wdt:P180 ?species . # depicts
    ?file wdt:P6731 wd:Q63348069 . # Commons quality assessment: Commons quality image
    ?file schema:url ?image .
}
hannahbast commented 1 year ago

The instance is up and running now (it took < 2 h to build it). Here are links to your two example queries:

https://qlever.cs.uni-freiburg.de/wikimedia-commons/4TOZwl

https://qlever.cs.uni-freiburg.de/wikimedia-commons/MyAdzj

tuukka commented 1 year ago

Wow, thank you! I didn't realise you have already implemented federated queries with the SERVICE keyword too.

I've let some people now in the Wikimedia Hackathon know about this (unfortunately I couldn't attend), this can be very valuable to everyone building tools for SDC.

The holy grail application of this would be faceted search, do you have any tips regarding that? I found ql:has-predicate, is that what we should build on top of? And you wouldn't happen to have a UI similar to this already? :grin: https://github.com/joseignm/GraFa

What I have so far: Property counts https://qlever.cs.uni-freiburg.de/wikidata/XBe4M8 Object counts https://qlever.cs.uni-freiburg.de/wikidata/zu5gUm

hannahbast commented 1 year ago

Isn't the context-sensitive autocompletion of the QLever UI doing this (and much more)?

For example, if you go to https://qlever.cs.uni-freiburg.de/wikidata you can

  1. Type S and hit Return to get the SELECT * WHERE { ... } query template.
  2. Type a variable name, for example subject
  3. Type any prefix of instance of (or any other alias of wdt:P31) and select wdt:P31/wdt:P279* from the list of suggestions
  4. Type the prefix of any class (for example, per for Person) and select from the list of suggestions
  5. Execute the query

You can incrementally construct arbitrary queries that way.

hannahbast commented 1 year ago

PS: You can also take your query and extend it by a prefix filter, likes so (prefix filters are very efficient in QLever):

https://qlever.cs.uni-freiburg.de/wikidata/nN9IDv

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?object (SAMPLE(?object_label) AS ?label) (COUNT(?object) as ?count) WHERE {
  ?item wdt:P18 ?image .
  ?item wdt:P31/wdt:P279* wd:Q838948 .
  ?item wdt:P180 ?object .
  ?object rdfs:label ?object_label .
  FILTER (LANG(?object_label) = "en") .
  FILTER REGEX(STR(?object_label), "^per")
}
GROUP BY ?object ?object_label 
ORDER BY DESC(?count)
tuukka commented 1 year ago

Isn't the context-sensitive autocompletion of the QLever UI doing this (and much more)?

We have mostly in mind end users who don't understand Sparql :grin: But yes, the functionality is more or less there in the current editor - that's how I figured it should be feasible. Although, I don't see the counts for the autocompletion candidates in the UI :thinking:

PS: You can also take your query and extend it by a prefix filter

Good point. I think I need to add a LIMIT to not have too many options client-side, and instead have a server-side prefix filter like that.

I think I'll first add simple, non-faceted depictions to Wikidocumentaries though, e.g. here (earlier based only on Wikidata and text search of Commons): https://wikidocumentaries-demo.wmcloud.org/wikipedia/en/Hummingbird?language=en

hannahbast commented 1 year ago

Can you explain the use case for the faceted search a bit more? What is it that users ultimately want when, for example, they type per in order to find human (Q5), or hum to find hummingbird (Q43624).

Is the goal just to find the right QID? That you can also do that with the search box on the top right of https://www.wikidata.org , right. In our Information Retrieval lecture, we have an exercise, where the goal is to build a version of this with fuzzy search (that is, you can make mistakes). Here is a demo: https://qlever.cs.uni-freiburg.de/wikidata-entity-search (for example, type huming).

If that is not the primary goal, what are the subsequent steps?

tuukka commented 1 year ago

In addition to making better query builders, I have in mind using faceted search as a powerful tool for exploring big collections (museums, archives, shops) with potentially spotty metadata.

My example queries above come from the hackathon participants' test case of exploring works of art with photos available. The property counts show there are 800k such works just in Wikidata, so I can't go through them one by one. But the counts also give me the idea I could filter by e.g. the collection, location, author, material, what is depicted etc. Or also by e.g. color, but I wouldn't get that many results. Say, I want to choose what is depicted. Next, I can see I could add a filter to see e.g. 14k portraits or 4k horses. I can continue until I (hopefully) see what I want, perhaps after backtracking a few times.

There will be some complications like in the hummingbird case I want to filter by a property path instead of a direct property, but e.g. Wikidata Query Builder seems to have some knowledge of typical property paths to use ("Include related values in the search").

hannahbast commented 1 year ago

Interesting, thanks.

As a matter of fact, our older UIs were all faceted-search UIs, for example: https://broccoli.cs.uni-freiburg.de . You can start with a class (for example, Person or Written Work) and then refine from there using the facets (select a subclass, select an instance, add a property, refine via co-occurring words). Is that something of the kind you imagine?

Such UIs are easier to use, but limited in the kind of query you can ask on the data. That's why we eventually developed QLever. UI-wise, the ideas was that it can be useful in two ways:

  1. You can use it to incrementally construct arbitrary SPARQL queries, as done in the QLever UI. That is very powerful, but asking too much from some users.

  2. You can , with little effort, build a special-purpose UI on top of the API. This requires that the suggestions can be computed very efficiently via SPARQL queries themselves, which is the case for QLever (other SPARQL engines are not good at these kind of queries).

tuukka commented 1 year ago

I knew you are working on the cutting edge, but Broccoli must've been 10 years ahead of its time!

Regarding your second point, do you happen to have an example/code of such a special-purpose UI?

We have now a first very limited but testable implementation of faceted browsing in Wikidocumentaries, see "Depictions from Wikimedia Commons" e.g. here: https://wikidocumentaries-demo.wmcloud.org/wikipedia/en/Birds?language=en

Our code is available here and any feedback is welcome especially in how to improve the Sparql queries: https://github.com/Wikidocumentaries/wikidocumentaries-ui/blob/master/src/components/topic_page/DepictingImages.vue

hannahbast commented 1 year ago

Quick question: Queries like https://qlever.cs.uni-freiburg.de/wikimedia-commons/MdUKbU are coming from you, right?

I am asking because for some reason the contained SERVICE queries all take 5 + epsilon seconds and we don't yet know why (they should be much faster, since the respective queries to the Wikidata instance are fast).

If you want to ask more such queries, I would for now (until we resolve that problem) simply build a joint index for Wikdiata and Wikimedia Commons.

And out of curiosity: who is generating the traffic, is it you via tests or is it actual users?

tuukka commented 1 year ago

Sorry for the delay, for some reason I didn't get a notification about your message.

Quick question: Queries like https://qlever.cs.uni-freiburg.de/wikimedia-commons/MdUKbU are comming from you, right?

Right.

I am asking because for some reason the contained SERVICE queries all take 5 + epsilon seconds and we don't yet know why (they should be much faster, since the respective queries to the Wikidata instance are fast).

Good to know. Something that might matter is that I currently make multiple parallel requests, perhaps they interact badly?

If you want to ask more such queries, I would for now (until we resolve that problem) simply build a joint index for Wikdiata and Wikimedia Commons.

I wanted to ask about that anyway, since the need for SERVICE makes some query features more difficult to write and some perhaps impossible: if I want to ask for ql:has-predicate of something inside SERVICE, can it take into account the restrictions caused by triples outside of SERVICE?

Also, if preferable and the necessary scripts / instructions are available, I may be able to set up an instance on a Wikimedia Cloud VPS.

And out of curiosity: who is generating the traffic, is it you via tests or is it actual users?

Probably both, and also bots. If you have User-Agent logs, you should be able to tell apart Googlebot, my dev environment (on Linux Firefox), and actual users.

tuukka commented 1 year ago

And out of curiosity: who is generating the traffic, is it you via tests or is it actual users?

Probably both, and also bots. If you have User-Agent logs, you should be able to tell apart Googlebot, my dev environment (on Linux Firefox), and actual users.

Also, if you can log the Origin header of the requests, you will see which queries come from the deployed version at https://wikidocumentaries-demo.wmcloud.org/

And let me know if I should limit the number and/or complexity of the requests I'm sending.

hannahbast commented 1 year ago

Quick update: We have found (already yesterday) the reason why the SERVICE queries always took "5 + epsilon" seconds. The respective QLever backend ran inside of a docker container and it so happened that docker containers on that particular machine had a five second latency for any network-related request (probably due to problems with the DNS lookup). As a quickfix, the backend now runs outside of docker and the SERVICE queries are as fast as they should be, for example: https://qlever.cs.uni-freiburg.de/wikimedia-commons/fwdZ1M

tuukka commented 1 year ago

I'm trying to finish a "version 0.9" of the UI, but I'm getting a lot of indeterministic 400 out-of-memory responses. I had a look at the query analyzer and two things popped out:

  1. Even if there are few items (and files), ?file (wdt:P180|wdt:P921|wdt:P6243|wdt:P195|wdt:P608) ?item is slow. It seems it would be faster to do ?file ?p ?item and filter afterwards. Is this expected?

  2. Even if there are few files (and images), ?file schema:url ?image is slow:

    INDEX SCAN ?file <url> ?image
    Cols: ?file, ?image
    Size: 87,160,263 x 2 [~ 87,160,263]
    Time: 137ms [~ 87,160,263]

Full query: https://qlever.cs.uni-freiburg.de/wikimedia-commons/dGYCdH

And more in general, are there any new thoughts regarding this topic (faceted browsing of SDC+Wikidata, and my implementation of it) from your side?

hannahbast commented 1 year ago

I am currently traveling and will try to look at it tonight or tomorrow. Maybe @joka921 can say something about the out of memory responses?

tuukka commented 1 year ago

Thank you @hannahbast!

After I wrote my previous message, the Sparql endpoint went down and now only responds 503 Service Unavailable: https://qlever.cs.uni-freiburg.de/api/wikimedia-commons

hannahbast commented 1 year ago

I am sorry, I don't know what happened, but the endpoint is now up again. More tomorrow

joka921 commented 1 year ago

@tuukka Thanks for your feedback. It seems like your faceted system issues a lot of queries that contain follow a similar template and only have a small variable part (That is very typical for such applications where you build some kind of frontend application that internally issues SPARQL queries). The easiest solution would be to identify the building blocks of your queries that

  1. Are part of (almost) every query your system issues and
  2. Are comparatively expensive to compute.

Given your example queries given previously in this thread, it seems like this could be the case for the complete schema:url predicate, and the union (wdt:P180|wdt:P921|wdt:P6243|wdt:P195|wdt:P608) We could then precompute these building blocks, and pin them to our subtree cache, then they wouldn't have to be computed from scratch for every query that uses them. By default we pin for example the predicates for English labels, as they occur in almost every query.

Additionally we could in general try to perform some query engineering (reformulate queries in a way that is equivalent but cheaper to compute, query planning is a hard problem, and sometimes we can help the systems).

If you can identify such parts of queries that happen in many of your requests and point them out, then we can try to pin some of them to the cache and see, whether this helps your system.