ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
376 stars 45 forks source link

handling the mul "language" #1475

Open pfps opened 4 weeks ago

pfps commented 4 weeks ago

Wikidata is adding a "mul" language, to be used for labels, in particular, when many languages have the same label for an item.

I think that this means that using @en@rdfs:label will not work well, as many items will not have an en label, but instead a mul label.

Is it possible to string this construct together, so that @en@mul@rdfs:label will get the en label if there is one and the mul label otherwise? Or is there some other construct that would work (aside from the three-line construct that gets both and does a COALESCE)?

hannahbast commented 3 weeks ago

@pfps Can you give an example? What is the semantics of FILTER(LANG(?literal) = "en") when ?literal has the language tag @mul? And what is the motivation for this?

pfps commented 3 weeks ago

Wikidata is trying to cut down on the number of triples in the RDF dump. One thing that contributes to the large number of triples is repeated labels, e.g., for https://www.wikidata.org/wiki/Q892 where you can see the repeated labels. At https://www.wikidata.org/wiki/Q42 you can see the new way, with a mul "language" label (showing up under "default for all languages") and many of the other languages just using that. (The grey ones.)

What this means is that to get the English label for an item one has to do something like OPTIONAL { ?x rdfs:label ?xLabelm. FILTER ( lang(?x) = "mul") } OPTIONAL { ?x rdfs:label ?xLabele. FILTER (lang(?x) = "en" )} BIND (COALESCE(?xLabele, ?xLabelm) AS ?xLabel) The issue is that there will often not be an "en" label if there is a "mul" label.

I'm not saying that this is a good thing at all.

tuukka commented 3 weeks ago

I'm not saying that this is a good thing at all.

As I understand it, "mul" is being introduced because of Wikidata's internal reasons, with no other way forward found regarding Wikidata's scalability. I don't think anyone wanted to break compatibility with all the existing queries and tools, but this is the current situation. Also, my current impression is that the semantics haven't been fully figured out and it will depend on how Wikidata's editors will start to use this new feature in the software.

One way to handle this in QLever is to preprocess the dumps by copying "mul" labels to "en" labels where there isn't one already. This would restore compatibility with existing queries. (The weird thing is that you won't know which of "mul" labels will work in English and which won't, but apparently this is as designed. There may be some useful heuristics such as "copy the labels only if the writing system is Latin or the item is an instance of Q5 (human).")

The other way is to try to do the same as WDQS and keep a representation of Wikidata's internal model. This is useful for "maintenance" queries by people and tools that edit Wikidata ("I want to see all the items where the en label and mul label disagree."). To keep query writing practical, QLever would need some new syntax. It might make sense to support the same label service as WDQS (for compatibility and for query performance).

(Now that I think of it, you could combine these two approaches by providing yet another language code "en without mul" for those queries that want to keep en and mul separate.)