digipres / sentinel

The Sentinel watches various data source and updates digipres.org
Apache License 2.0
5 stars 3 forks source link

Review WikiData aggregation to check the format count is accurate #13

Closed anjackson closed 2 years ago

anjackson commented 2 years ago

The WikiData aggregation appears to generate a denormalised listing, i.e. if a given format has multiple something (extensions? signatures?) then there are separate records for each ID. i.e. if you look at the query in question:

https://query.wikidata.org/#%23%20Return%20all%20file%20format%20records%20from%20Wikidata.%0A%23%0Aselect%20distinct%20%3Furi%20%3FuriLabel%20%3Fpuid%20%3Fextension%20%3Fmimetype%20%3FencodingLabel%20%3FreferenceLabel%20%3Fdate%20%3FrelativityLabel%20%3Foffset%20%3Fsig%0Awhere%0A%7B%0A%20%20%3Furi%20wdt%3AP31%2Fwdt%3AP279%2a%20wd%3AQ235557.%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20Return%20records%20of%20type%20File%20Format.%0A%20%20optional%20%7B%20%3Furi%20wdt%3AP2748%20%3Fpuid.%20%20%20%20%20%20%7D%20%20%20%20%20%20%20%20%20%20%23%20PUID%20is%20used%20to%20map%20to%20PRONOM%20signatures%20proper.%0A%20%20optional%20%7B%20%3Furi%20wdt%3AP1195%20%3Fextension.%20%7D%0A%20%20optional%20%7B%20%3Furi%20wdt%3AP1163%20%3Fmimetype.%20%20%7D%0A%20%20optional%20%7B%20%3Furi%20p%3AP4152%20%3Fobject%3B%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20Format%20identification%20pattern%20statement.%0A%20%20%20%20optional%20%7B%20%3Fobject%20pq%3AP3294%20%3Fencoding.%20%20%20%7D%20%20%20%20%20%23%20We%20don%27t%20always%20have%20an%20encoding.%0A%20%20%20%20optional%20%7B%20%3Fobject%20ps%3AP4152%20%3Fsig.%20%20%20%20%20%20%20%20%7D%20%20%20%20%20%23%20We%20always%20have%20a%20signature.%0A%20%20%20%20optional%20%7B%20%3Fobject%20pq%3AP2210%20%3Frelativity.%20%7D%20%20%20%20%20%23%20Relativity%20to%20beginning%20or%20end%20of%20file.%0A%20%20%20%20optional%20%7B%20%3Fobject%20pq%3AP4153%20%3Foffset.%20%20%20%20%20%7D%20%20%20%20%20%23%20Offset%20relatve%20to%20the%20relativity.%0A%20%20%20%20optional%20%7B%20%3Fobject%20prov%3AwasDerivedFrom%20%3Fprovenance%3B%0A%20%20%20%20%20%20%20optional%20%7B%20%3Fprovenance%20pr%3AP248%20%3Freference%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20pr%3AP813%20%3Fdate.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%20%20%7D%0A%20%20service%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2C%20%3C%3Clang%3E%3E%22.%20%7D%0A%7D%0Aorder%20by%20%3Furi

Then the same Q######## identifiers appear in multiple lines. The current imported may not be handling this correctly. It should gather records by ID and assemble a list of extensions/mimetypes for each ID.

anjackson commented 2 years ago

I think this is actually manifesting as file extensions (and possibly MIME types) getting dropped, because we end up with one record per ID.

anjackson commented 2 years ago

Seems to be more accurate now, with there being minor discrepancies if there are malformed file extensions.

ross-spencer commented 2 years ago

I was worried this was the SPARQL query for a moment! 😉

Testing with Q100243790 Q1023647 in Siegfried, extensions and mimes look okay.

anjackson commented 2 years ago

I stole the query from Siegfried so it should work!

The issue is in my post-processing. I should perhaps use roy directly instead of having my own fetcher/normaliser, but it's not trivial to switch (see #15).

ross-spencer commented 2 years ago

I know, I wrote the query (hence the concern!).

The issue certainly looks complicated. There are a handful of reasons I don't think you're going in the wrong direction working with the WDQS output in Python, but perhaps it's a useful feature in Siegfried. Keep an eye out for linting getting in the way of stats: https://github.com/richardlehane/siegfried/wiki/Wikidata-identifier#linting and perhaps inspect the Wikidata module in more detail: https://pkg.go.dev/github.com/richardlehane/siegfried@v1.9.4/pkg/wikidata (it can theoretically be used independently, or more functions/structures can be exposed to any potential callers since it has done most of the work). Also, Fido have it on their roadmap, so, something Python is going to appear at some point.

Happy to talk more next week if you're interested.

One concern I have in your issue 15 is: modify the wikidata.sig build so the Archiveamatica extensions can be omitted (like -pronom) - those extensions should be omitted anyway, so is that a bug with Siegfried we need to correct?

anjackson commented 2 years ago

Ah right, thanks for that. I have just been filtering out results that don't have at least a file extension or MIME type, rather than doing proper linting of records. I'd appreciate talking over some of this with you next week if we get chance!

I don't know if that's a Siegfried bug, but it seems weird so I guess I'll raise it.

EDIT Also, I added some notes on possible importer improvements in #16