ffdev-info / wikidp-issues

An issues repository for resolving issues in Wikidata around the records relating to Digital Preservation
GNU General Public License v3.0
1 stars 0 forks source link

Q26543628 can't return/decode Unicode quantifier '±' to denote variable (maximum) offset #12

Open ross-spencer opened 3 years ago

ross-spencer commented 3 years ago

Description of problem

A handful of records have a value in the offset: X±Y to seemingly denote maximum offset which I believe is a variable position offset in PRONOM, so this value can be anywhere in the first or last range of 72 bytes.

While this seems like a reasonable shortcut the value isn't decoding so we only get one value in the SPARQL result. I.e. if we expect 72±72. We receive 72 in the SPARQL result or the WQS UI.

image

image

Other considerations here are, if this is data to encode, we actually have to parse this further to decide what type of field we're looking at: if ± then this is a maximum offset and not a regular offset. Because we receive "some value" i.e. 72 we also don't know there is a problem with the data. I don't think we can know if there is a problem with the data without also knowing the Wikidata record that we are looking at.

So I am wondering, 1. if this is first an issue that should be logged with Wikibase which seems sensible, and 2. if this is another piece of work for me to look at to bring into Wikidata the concept of maximum offsets in association with the ShEX work needed.

NB. ± is a plus/minus symbol. Comparing with PRONOM 72±72 is actually trying to denote a maximum offset of 144. Which I can work out from that string, but I wouldn't have known how to use otherwise.

Permalink

Other examples

These all seem to be PDF variants

Notes on auditing

I accessed the PRONOM reports and output a rudimentary subset of the XML:

cat * | grep -e Identifier -e MaxOffset

And then cross-referenced as many of those as possible with:

select distinct ?format ?formatLabel 
where
{
  ?format wdt:P31/wdt:P279* wd:Q235557. 
  ?format wdt:P2748 "<PUID HERE>".
  ?format p:P4152 ?object. 

  # Wikidata's mechanism to return labels from SPARQL parameters.
  service wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}