ffdev-info / wikidp-issues

An issues repository for resolving issues in Wikidata around the records relating to Digital Preservation
GNU General Public License v3.0
1 stars 0 forks source link

Cannot retrieve the correct encoding for the format identification pattern for Q162839 #2

Closed ross-spencer closed 3 years ago

ross-spencer commented 4 years ago

Description of problem

I am unable to get the correct encoding out of the SPARQL query for the XZ compression format. At the moment, an encoding of hexadecimal is being returned, but according to the Wikidata record it should be "ASCII". Additionally references are being returned incorrectly. This is likely to affect all results we want where the source triples have similar complexity.

Query: here.

SELECT DISTINCT ?format ?formatLabel ?puid ?ldd ?extension ?mimetype ?sig ?referenceLabel ?date ?encodingLabel ?offset ?relativityLabel WHERE
{
  ?format wdt:P31/wdt:P279* wd:Q235557.
  ?format wdt:P2748 "fmt/1098". 
  OPTIONAL { ?format wdt:P3266 ?ldd }
  OPTIONAL { ?format wdt:P1195 ?extension }
  OPTIONAL { ?format wdt:P1163 ?mimetype }
  OPTIONAL { ?format wdt:P4152 ?sig }
  OPTIONAL {
     ?format p:P4152 ?object.
     ?object prov:wasDerivedFrom ?provenance.
     ?provenance pr:P248 ?reference;
        pr:P813 ?date.
  }
  OPTIONAL {
     ?format p:P4152 ?object.
     ?object pq:P3294 ?encoding.
  }
  OPTIONAL {
     ?format p:P4152 ?object.      
     ?object pq:P4153 ?offset.
  }
  OPTIONAL {
     ?format p:P4152 ?object.
     ?object pq:P2210 ?relativity.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
order by ?format
{
      "format" : {
        "type" : "uri",
        "value" : "http://www.wikidata.org/entity/Q162839"
      },
      "formatLabel" : {
        "xml:lang" : "en",
        "type" : "literal",
        "value" : "xz"
      },
      "extension" : {
        "type" : "literal",
        "value" : "xz"
      },
      "mimetype" : {
        "type" : "literal",
        "value" : "application/x-xz"
      },
      "puid" : {
        "type" : "literal",
        "value" : "fmt/1098"
      },
      "sig" : {
        "type" : "literal",
        "value" : "ý7zXZ"
      },
      "referenceLabel" : {
        "xml:lang" : "en",
        "type" : "literal",
        "value" : "Gary Kessler's File Signature Table"
      },
      "date" : {
        "datatype" : "http://www.w3.org/2001/XMLSchema#dateTime",
        "type" : "literal",
        "value" : "2017-08-07T00:00:00Z"
      },
      "encodingLabel" : {
        "xml:lang" : "en",
        "type" : "literal",
        "value" : "hexadecimal"
      },
      "offset" : {
        "datatype" : "http://www.w3.org/2001/XMLSchema#decimal",
        "type" : "literal",
        "value" : "0"
      },
      "relativityLabel" : {
        "xml:lang" : "en",
        "type" : "literal",
        "value" : "beginning of file"
      }
    }

Permalink

https://www.wikidata.org/w/index.php?title=Q162839&oldid=1148885804

ross-spencer commented 4 years ago

Okay, this is going to affect a number of records, e.g. https://www.wikidata.org/w/index.php?title=Q1076355&oldid=1133318648 (Portable Executable) where we have a hexadecimal and PRONOM encoding but we're only retrieving the hexadecimal type for the PRONOM type and so it makes it more difficult to process.

emulatingkat commented 3 years ago

I don't yet have a complete solution for this, but I have a query that will get us part of the way there. Try it!

SELECT ?format ?formatLabel ?object ?encodingLabel ?referenceLabel
WHERE 
{ ?format wdt:P2748 "fmt/1098".
  ?format p:P4152 ?object;
          p:P4152 [pq:P3294 ?encoding].
     ?object prov:wasDerivedFrom ?provenance.
     ?provenance pr:P248 ?reference;
        pr:P813 ?date.

  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

This demonstrates how we can get each of the encoding values to return, but is still missing info we want like the pronom provenance.

ross-spencer commented 3 years ago

That's great Kat thank you. Looking forward to chatting more about it today.

ross-spencer commented 3 years ago

For now I think we can use (without the statement references):

# Return all file format records from Wikidata. 
# 
select distinct ?format ?formatLabel ?puid ?extension ?mimetype ?encodingLabel ?relativityLabel ?sig
where
{
  ?format wdt:P31/wdt:P279* wd:Q235557.          # Return records of type File Format.
  optional { wd:Q162839 wdt:P2748 ?puid.   }     # PUID is used to map to PRONOM signatures proper.
  optional { ?format wdt:P1195 ?extension  }
  optional { ?format wdt:P1163 ?mimetype   }
  # We don't need to require that there is a format identification pattern because
  # we want to be able to provide results for items without them that are mapped to
  # PRONOM anyway.
  optional { ?format p:P4152 ?object;              # Format identification pattern statement.
    optional { ?object pq:P3294 ?encoding.   }     # We don't always have an encoding.
    optional { ?object ps:P4152 ?sig.        }     # We always have a signature.
    optional { ?object pq:P2210 ?relativity. }     # Relativity to beginning or end of file.
    optional { ?object pq:P4153 ?offset.     }     # Offset relatve to the relativity.
  }

  # Wikidata's mechanism to return labels from SPARQL parameters.
  service wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], <<lang>>". }
}

It's pretty easy to audit the result of this by eye and see that encoding aligns properly so we should be able to complete the conversion of the current corpus of Wikidata signatures to give us something to work with.

If we output ?object then we can still access provenance information, but that would result in a lot of additional queries. A query to receive provenance from a single format identification pattern statement might be as follows:

select ?provenance ?referenceLabel ?date ?formatLabel ?format 
where 
{  
  # We know the reference object associated with the format we might be looking at so
  # let's retrieve information about it. 
  wds:Q10287816-de2bb2b1-4bd1-108f-0694-03249cf5a9e2 prov:wasDerivedFrom ?provenance.
  ?provenance pr:P248 ?reference;
              pr:P813 ?date.
  # Grabbing the format ID and label for reference.
  ?format p:P4152 wds:Q10287816-de2bb2b1-4bd1-108f-0694-03249cf5a9e2. 
  service wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ross-spencer commented 3 years ago

Hi @emulatingkat I've been looking more deeply at this over the weekend as I wanted to make a decent push to having everything in place. I think the SPARQL I need might be:

    # Return all file format records from Wikidata.
    #
    select distinct ?format ?formatLabel ?puid ?extension ?mimetype ?encodingLabel ?referenceLabel ?date ?relativityLabel ?offset ?sig
    where
    {
      ?format wdt:P31/wdt:P279* wd:Q235557.            # Return records of type File Format.
      optional { ?format wdt:P2748 ?puid.   }          # PUID is used to map to PRONOM signatures proper.
      optional { ?format wdt:P1195 ?extension  }
      optional { ?format wdt:P1163 ?mimetype   }
      # We don't need to require that there is a format identification pattern because
      # we want to be able to provide results for items without them that are mapped to
      # PRONOM anyway.
      optional { ?format p:P4152 ?object;              # Format identification pattern statement.
        optional { ?object pq:P3294 ?encoding.   }     # We don't always have an encoding.
        optional { ?object ps:P4152 ?sig.        }     # We always have a signature.
        optional { ?object pq:P2210 ?relativity. }     # Relativity to beginning or end of file.
        optional { ?object pq:P4153 ?offset.     }     # Offset relatve to the relativity.

        optional { ?object prov:wasDerivedFrom ?provenance;
           optional { ?provenance pr:P248 ?reference;
                                  pr:P813 ?date.
                    }
        }
      }
      # Wikidata's mechanism to return labels from SPARQL parameters.
      service wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
    }
    order by ?format

The difference to the original approach seems to be the way the optional statements are nested.

I'd still love to know a more elegant approach, but for now, this might be good enough to embed in the Siegfried work. I've performed a significant amount of spot-checking this past weekend and I think all the data aligns as expected.

ross-spencer commented 3 years ago

As per above, this was fixed, and then made available with Siegfried 1.9.0: https://github.com/richardlehane/siegfried/releases/tag/v1.9.0