beehind / beehind.github.io

Beehind: pilot workflows to capture prominent bee specimen and their historic and ecological associates
https://beehind.org
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

search wikidata images by their checksums (content hashes) #14

Open jhpoelen opened 1 year ago

jhpoelen commented 1 year ago

Internally, Wiki Commons uses sha1 hashes to alert users whether duplicate digital data is already available via Wiki Commons.

However, as far as I can tell, these sha1 hashes are not yet exposed via structured data by default.

And, methods already exist to annotate digital content with their checksums.

For example, see https://www.wikidata.org/wiki/Q34852 were https://www.wikidata.org/wiki/Property:P4092 is used to document sha-2 hash 8de979cbb1db728ef99debac8a516405a2088e4fa2816fda2769856a54029bcd49913a45494ce1cae4096413c49ae7da36f7bc2d20899fb216195b9eb365e55c associated with digital content .

image

jhpoelen commented 1 year ago

Accordingly, I've manually annotated a wikimedia commons entry

https://commons.wikimedia.org/wiki/File:Agapostemon_texanus_killed_by_Peucetia_viridans_-_iNaturalist_56389401.jpg

with their associated checksums in sha1, sha-256 and md-5 speak.

Screenshot from 2023-05-08 09-23-58

jhpoelen commented 1 year ago

a sample query

SELECT ?item ?image WHERE {
  ?item wdt:P4092 "85379b346e61c06033a12720155f3bf13d2c6f5946625600f34edace55cb159d693a15aefab9e15691ff2402887985d559951327974206ccf85495e27b9ee56d";
        wdt:P18|wdt:P117 ?image .
}
LIMIT 10

with results obtained via https://query.wikidata.org/#SELECT%20%3Fitem%20%3Fimage%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP4092%20%2285379b346e61c06033a12720155f3bf13d2c6f5946625600f34edace55cb159d693a15aefab9e15691ff2402887985d559951327974206ccf85495e27b9ee56d%22%3B%0A%20%20%20%20%20%20%20%20wdt%3AP18%7Cwdt%3AP117%20%3Fimage%20.%0A%7D%0ALIMIT%2010

Screenshot from 2023-05-08 09-44-59

jhpoelen commented 1 year ago

Note that structured queries against objects in wikimedia commons are still under development. See for instance, https://diff.wikimedia.org/2020/10/29/sparql-in-the-shadow-of-structured-data-on-commons/ and referenced https://commons.wikimedia.org/wiki/Commons:Structured_data .

Also, note that annotating checksum properties (see https://www.wikidata.org/wiki/Property:P4092 ) on image properties in wikidata objects doesn't seem to come natural because qualifiers on qualifiers appears to be too much nesting for the wikidata model.

For instance, adding a checksum (or content hash) for an image that supports a physical interaction ( https://www.wikidata.org/wiki/Q2747101#P129 ) for a specific taxon https://www.wikidata.org/wiki/Q2747101 appears to be tricky with existing UI editing tools. E.g., is it currently hard to add a "determined by" quality SHA-1 algorithm for the checksum qualifier for the image related to the physical interaction property.

image

image

jhpoelen commented 1 year ago

It appears that the wikimedia commons entities are a more natural fit . . . and some patience in needed before being able to access this structure commons data for reasons stated earlier.

image

jhpoelen commented 1 year ago

So, as far as I can tell, querying wikimedia commons images by their checksums is possible, and a dedicated service / data product would have to be create to help answer questions like:

What are the check sums (or content hashes) associated with this wikimedia commons entity?

and

Please provide content associated with this content id (or checksum) if you have it. Otherwise, say "mweh, don't have it."