ad-freiburg / aqqu-frontend

An easy to use frontend for Aqqu
Apache License 2.0
2 stars 0 forks source link

Images for tooltips from wikidata and wikipedia #1

Open graue70 opened 4 years ago

graue70 commented 4 years ago

The image in the wikipedia infobox is not always from wikidata. See https://www.wikidata.org/wiki/Q16742294 and https://www.wikidata.org/wiki/Q16742291, which might be helpful in determining differences.

As explained here, the wikipedia image can be queried in the following way: https://en.wikipedia.org/w/api.php?action=query&prop=pageimages&titles=Jaguar&pithumbsize=500&format=json&formatversion=2.

Per default, it returns only images with a free license. For Lord of the Rings, the image is not free, so it is not returned. However, it is possible to return any (including non-free) image with the additional argument pilicense=any, as in https://en.wikipedia.org/w/api.php?action=query&prop=pageimages&titles=The_Lord_of_the_Rings:_The_Fellowship_of_the_Ring&pithumbsize=500&format=json&formatversion=2&pilicense=any. I don't know what the licensing means for aqqu tooltips, but there is more info on that here.

It is possible to query multiple images with one query: https://en.wikipedia.org/w/api.php?action=query&prop=pageimages&titles=The_Lord_of_the_Rings:_The_Fellowship_of_the_Ring|Sun|Jaguar&pithumbsize=500&format=json&formatversion=2&pilicense=any.

Maybe one option would be to use the following query:

PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX schema: <http://schema.org/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?x ?m ?image ?sitelinks WHERE {
  ?m schema:about ?x .
  ?m @en@schema:abstract ?abstract .
  OPTIONAL { ?x wdt:P18 ?image . }
  ?m schema:isPartOf <https://en.wikipedia.org/> .
  ?article schema:about ?x .
  ?article wikibase:sitelinks ?sitelinks .
  FILTER (?sitelinks >= "15"^^<http://www.w3.org/2001/XMLSchema#int>)
} ORDER BY DESC(?sitelinks)

and then loop over the results without an image and use the wikipedia image only for those.

On the other hand, maybe one should prefer the wikipedia image over the wikidata image. For the example of mexico, wdt:P18 yields a bunch of images, but an image of the flag (P41) would probably be more useful. Wikipedia uses the flag in this case.

In either case, the script or command to produce the file qid_to_wikipedia.tsv should be included in the repo for better reproducibility, especially regarding entities with more than one image.

graue70 commented 4 years ago

Ignoring wikipedia completely at the moment, these are four possible ways to express the image in the sparql query from above:

?x wdt:P18 ?image .
?x wdt:P18|wdt:P109|wdt:P14|wdt:P1442|wdt:P154|wdt:P1543|wdt:P158|wdt:P1766|wdt:P1801|wdt:P2096|wdt:P2713|wdt:P2716|wdt:P2910|wdt:P3311|wdt:P3383|wdt:P3451|wdt:P367|wdt:P41|wdt:P4291|wdt:P4640|wdt:P5252|wdt:P5775|wdt:P7407|wdt:P7415|wdt:P94|wdt:P996 ?image .
OPTIONAL { ?x wdt:P18 ?image . }
OPTIONAL { ?x wdt:P18|wdt:P109|wdt:P14|wdt:P1442|wdt:P154|wdt:P1543|wdt:P158|wdt:P1766|wdt:P1801|wdt:P2096|wdt:P2713|wdt:P2716|wdt:P2910|wdt:P3311|wdt:P3383|wdt:P3451|wdt:P367|wdt:P41|wdt:P4291|wdt:P4640|wdt:P5252|wdt:P5775|wdt:P7407|wdt:P7415|wdt:P94|wdt:P996 ?image . }

One still needs to deal with duplicates because of multiple images for one entity. Some kind of preference would be good which would be possible with the BIND(IF(BOUND())) construct from https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial#BIND,_BOUND,_IF, but that's not supported by qlever at the moment.

PS: The list of predicates was generated with this query:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?image ?label WHERE {
  wd:P18 wdt:P1659 ?image .
  ?image rdfs:label ?label .
  FILTER langMatches(lang(?label), "en") .
}
flackbash commented 4 years ago

Thanks for the detailed analysis! The current solution is to query the Wikipedia API for images and prefer these images over the images retrieved using a SPARQL query with properties wdt:P18|wdt:P109|wdt:P14|... as listed in your comment. What I have not implemented is a preference over the images retrieved using the SPARQL query. However, if I grep'ed correctly, only 21,871 images out of 404,840 in the current qid_to_wikipedi_info.tsv file stem from Wikidata anyway. All other images were retrieved using the Wikipedia API, so this should not be a big problem.

graue70 commented 4 years ago

How did you deal with the license question for wikipedia images?

flackbash commented 4 years ago

Not at all. I skillfully overlooked that part.

So right now, all images are included in the mapping, i.e. the pilicense=any parameter is set. Without setting this parameter, the final mapping contains 384,879 instead of 404,840. This is probably good enough if it saves us the hassle.

From what I understood, Wikipedia can use these non-free contents under the fair use policy which exists in the US but not in the EU (which is probably why the English Wikipedia contains theatrical release posters for films and the German Wikipedia does not). Too bad...