ffdev-info / wikidp-issues

An issues repository for resolving issues in Wikidata around the records relating to Digital Preservation
GNU General Public License v3.0
1 stars 0 forks source link

ShEX needed for standard signature types in Wikidata #8

Open ross-spencer opened 3 years ago

ross-spencer commented 3 years ago

Description of problem

We need to encode a standard signature as a ShEX (Shape Expression). And then initiate a discussion with the Wikidata community about what that means for the existing records.

Related issues

A Shape Expression would partially ensure https://github.com/ross-spencer/WikiDP-Issues/issues/7 was more accurate, but it wouldn't improve the overall quality of the signature, i.e. we would still have trouble understanding it as two separate signatures for the same format.

Others

emulatingkat commented 3 years ago

I wrote a schema for file formats with a file format identification pattern. This represents what is described in the Wikidata property in terms of expected qualifiers.

https://www.wikidata.org/wiki/EntitySchema:E237

ross-spencer commented 3 years ago

This is great Kat, thank you! I have a related question to follow up with at the weekend (no rush though). Just got to make it to the end of the working week to put it together, nearly there! :slightly_smiling_face:

ross-spencer commented 3 years ago

Hi @emulatingkat I was wondering about this. Below is a mock-up of a screen in Wikidata. You'll see two format identification pattern statements. Is it possible to write a Wikidata entry like this? And can ShEX be used to mock something up like this for Wikidata records?

It is based on https://www.wikidata.org/w/index.php?title=Q27229608&oldid=784082439

It is one of the clearest examples where the statement alone tries to bring together too much information but doesn't provide any way to interpret it. In comparison to the record proper, I'd love to be able to use two separate statements to say that each statement makes up a different identification pattern for Siegfried (or any format identification tool) to work with.

image

It would help us to delineate sets of sequences better where multiple sets of signatures can be used for a format, e.g. here in PNG 1.1, or from sequences drawn from different sources, e.g. Kessler's table and PRONOM might have two different values that don't really belong in the same statement as each other, and should be interpreted as different signatures which contain 1..* sequences.

Having sets like this would be pretty powerful. There might be a better solution? Any pointers appreciated. When I started looking at container signatures in Wikidata, then I wondered if statements within statements would be possible?

emulatingkat commented 3 years ago

Thanks for this example. I understand the complexity here thanks to your mockup.

All statements using a common property are displayed in the same statement block in Wikidata. Thus, as far as I know, the mockup you shared won't be possible in Wikidata.

We could propose a new property for something like a "compound file format identification pattern" or "multi-part format identification pattern" to express what you describe.

Thanks for highlighting this gap in the data model.