Open Dclipsham opened 5 years ago
Use case for this would be iWork 2019 documents. In order to distinguish between Pages, Numbers, & Keynote specific files would need to be referenced, but files specific to say Keynote have variable filenames.
In one sample we have "Slide1.iwa", in a later sample we have "Slide-9273.iwa" or "Slide-10181.iwa". Files have same content, but can't be referenced in container "
Another container format with variable names is the Project files for AutoDesk ReCap. .RCP files contain three files.
A BMP, JPG, and XML. The JPG and XML are named with a 32 alpha/numeric string which is unique to the file. The XML contains the root tag <Autodesk Version="1.0">
which would be a identifiable string for a signature, but the name of the XML file is not static.
The USDZ file format is another ZIP container file format with variable names and folders, making identification difficult. The USDC file inside can have a static name, a UUID, or both nested deep within other variable folder names.
Another container format with variable names is the Project files for AutoDesk ReCap. .RCP files contain three files. A BMP, JPG, and XML. The JPG and XML are named with a 32 alpha/numeric string which is unique to the file. The XML contains the root tag
<Autodesk Version="1.0">
which would be a identifiable string for a signature, but the name of the XML file is not static.
Link to sample RCP file
Propose could use glob patterns to express these names.
Rationale:
Another example of container format with variable name:
Tableau Packaged Workbook (.twbx) Samples here: https://community.tableau.com/s/topic/0TO4T000000RcA5WAK/workbook-calculation-library
Another example of a container format with a variable named file.
Web Archive Collection Zipped (WACZ) https://specs.webrecorder.net/wacz/1.1.1/#archive
@thorsted with WACZ there appears to be enough mandatory files, e.g. datapackage.json, pages/pages.jsonl etc to be able to create a reliable container sig pattern for. we've got a sig going thru internal testing right now that's proving reliable so far. Do you have files that are variant?
@thorsted with WACZ there appears to be enough mandatory files, e.g. datapackage.json, pages/pages.jsonl etc to be able to create a reliable container sig pattern for. we've got a sig going thru internal testing right now that's proving reliable so far. Do you have files that are variant?
No variants. Greg and I did send in the following to PRONOM, looks like it will be released in v110.
<File>
<Path>datapackage.json</Path>
<BinarySignatures>
<InternalSignatureCollection>
<InternalSignature ID="300">
<ByteSequence Reference="BOFoffset">
<SubSequence Position="1" SubSeqMinOffset="0" SubSeqMaxOffset="4096">
<Sequence>'wacz_version'</Sequence>
</SubSequence>
</ByteSequence>
</InternalSignature>
</InternalSignatureCollection>
</BinarySignatures>
</File>
Great, thank you!
I've tried implementing this on siegfried's develop branch. Now, if you use a container path that looks like a glob (contains *, ? or [] chars), & is a valid glob, then it will do glob instead of literal string matching.
@thorsted if you'd like some binaries to try let me know what OS and I can build for you. Or, if you can share some files and container signatures I can test for you.
With glob syntax you can do * and ? single/many wildcards:
Slide*.iwa
????????-????-????-????-????????????.xml
.You can also do character sets:
256_[abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789]
Character sets with a lot of repeats are verbose but maybe 256_????????????????
is precise enough anyway.
One issue I hit was a number of container signatures have "[Content_Types].xml" as the path test. If you interpret this as a glob, then the literal path won't match (instead C.xml, o.xml, n.xml etc. would all match). I ended up just special casing this but it is a potential future footgun unless you can distinguish when paths should be interpreted as globs/regex or as literal strings.
Nice work Richard!
I ended up just special casing this but it is a potential future footgun unless you can distinguish when paths should be interpreted as globs/regex or as literal strings.
Depending how far down the development path recent PRONOM work is, extending the container XML might be an option? an attribute in the XML?
<path type="glob">ACDC</path>
Or a new element?
<globPath>AC*DC</globPath>
There are a number of reasons it feel it would be better if it was explicit the pattern type, rather than trying to interpret as a reader of the signature file.
Depending how far down the development path recent PRONOM work is...
Answering my own question a bit as I write, but if DROID didn't complain about either of the two options, i.e. simply ignored they were there, then the XML could maybe be agreed upon and extended prior to its inclusion in a future PRONOM user interface?
Another option would be to enclose ambiguous paths (just [Content_Types].xml at the moment) within single or double quote marks to force literal matching (the same way you do it in a command line)
Started keeping a list. May be others I need to add. https://docs.google.com/spreadsheets/d/120Xt6oP4QVV3aj_MelvewytjBJNL4RgR-6z0DHWMT_E/edit?usp=sharing
@richardlehane I would love to test, any Mac version will do. I can also put together a test set of formats with these unique structures for all of us to test with.
@richardlehane I would love to test, any Mac version will do. I can also put together a test set of formats with these unique structures for all of us to test with.
there are fresh sf and roy binaries in the *mac64.zip file here: https://github.com/richardlehane/siegfried/releases/tag/v1.11.0-rc1
Related issue: https://github.com/digital-preservation/droid/issues/823
Another option would be to enclose ambiguous paths (just [Content_Types].xml at the moment) within single or double quote marks to force literal matching (the same way you do it in a command line)
Much more elegant.
The only issue may be backward compatibility.
If you zip a file with the name [content-type].xml
.
And use this combination of signature files:
Standard sig
<?xml version="1.0" encoding="UTF-8"?>
<FFSignatureFile xmlns="http://www.nationalarchives.gov.uk/pronom/SignatureFile" Version="1" DateCreated="2023-05-31T14:02:16">
<InternalSignatureCollection>
<InternalSignature ID="2" Specificity="Specific">
<ByteSequence Reference="BOFoffset">
<SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="0" SubSeqMaxOffset="4">
<Sequence>504B0304</Sequence>
</SubSequence>
</ByteSequence>
<ByteSequence Reference="EOFoffset">
<SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="61" SubSeqMaxOffset="65565">
<Sequence>504B01</Sequence>
</SubSequence>
</ByteSequence>
<ByteSequence Reference="EOFoffset">
<SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="0" SubSeqMaxOffset="65535">
<Sequence>504B0506</Sequence>
</SubSequence>
</ByteSequence>
</InternalSignature>
<InternalSignature ID="3" Specificity="Specific">
<ByteSequence Reference="BOFoffset">
<SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="0" SubSeqMaxOffset="0">
<Sequence>504B0304</Sequence>
</SubSequence>
</ByteSequence>
<ByteSequence Reference="BOFoffset">
<SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="4" SubSeqMaxOffset="30">
<Sequence>5B436F6E74656E745F54797065735D2E786D6C20A2</Sequence>
</SubSequence>
</ByteSequence>
<ByteSequence Reference="EOFoffset">
<SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="0" SubSeqMaxOffset="65535">
<Sequence>504B0102</Sequence>
</SubSequence>
</ByteSequence>
<ByteSequence Reference="EOFoffset">
<SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="0" SubSeqMaxOffset="65535">
<Sequence>504B0506</Sequence>
</SubSequence>
</ByteSequence>
</InternalSignature>
<InternalSignature ID="4" Specificity="Specific">
<ByteSequence Reference="BOFoffset">
<SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="0" SubSeqMaxOffset="0">
<Sequence>D0CF11E0A1B11AE1</Sequence>
</SubSequence>
</ByteSequence>
<ByteSequence Reference="BOFoffset">
<SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="0" SubSeqMaxOffset="28">
<Sequence>FEFF</Sequence>
</SubSequence>
</ByteSequence>
</InternalSignature>
</InternalSignatureCollection>
<FileFormatCollection>
<FileFormat ID="1" Name="Development Signature" PUID="dev/1" Version="1.0" MIMEType="application/octet-stream">
<Extension>ext</Extension>
</FileFormat>
<FileFormat ID="2" Name="ZIP Format" PUID="x-fmt/263" Version="" MIMEType="application/zip">
<InternalSignatureID>2</InternalSignatureID>
<Extension>zip</Extension>
</FileFormat>
<FileFormat ID="3" Name="Microsoft Office Open XML" PUID="fmt/189" Version="" MIMEType="application/octet-stream">
<InternalSignatureID>3</InternalSignatureID>
</FileFormat>
<FileFormat ID="4" Name="OLE2 Compound Document Format" PUID="fmt/111" Version="" MIMEType="application/octet-stream">
<InternalSignatureID>4</InternalSignatureID>
</FileFormat>
</FileFormatCollection>
</FFSignatureFile>
Container sig
<?xml version="1.0" encoding="UTF-8"?>
<ContainerSignatureMapping SchemaVersion="1.0" SignatureVersion="1">
<ContainerSignatures>
<ContainerSignature Id="2" ContainerType="ZIP">
<Description>Development Signature</Description>
<Files>
<File>
<Path>[content-type].xml</Path>
</File>
</Files>
</ContainerSignature>
</ContainerSignatures>
<FileFormatMappings>
<FileFormatMapping signatureId="2" Puid="dev/1"></FileFormatMapping>
</FileFormatMappings>
<TriggerPuids>
<TriggerPuid ContainerType="OLE2" Puid="fmt/111"></TriggerPuid>
<TriggerPuid ContainerType="ZIP" Puid="fmt/189"></TriggerPuid>
<TriggerPuid ContainerType="ZIP" Puid="x-fmt/263"></TriggerPuid>
</TriggerPuids>
</ContainerSignatureMapping>
Then the sequence will be matched in DROID and Siegfried:
---
siegfried : 1.9.3
scandate : 2023-05-31T16:10:10+02:00
signature : default.sig
created : 2023-05-31T16:10:08+02:00
identifiers :
- name : 'pronom'
details : 'my-standard-sig.xml; mysig.xml; built without reports'
---
filename : 'sample.ext'
filesize : 216
modified : 2023-05-31T15:35:24+02:00
errors :
matches :
- ns : 'pronom'
id : 'dev/1'
format : 'Development Signature'
version : '1.0'
mime : 'application/octet-stream'
basis : 'extension match ext; container name [content-type].xml with name only'
warning :
But then if that path is quoted, e.g. "[content-type].xml"
, neither tool knows what to do with that out of the box, so it doesn't match. Maybe there's an escape I am missing? Otherwise, I'm not sure OOTMH what the instruction is to DROID here.
So, the question may come down to how does Siegfried or DROID start to consider taking advantage of a glob enabled signature files today, while enabling current versions to still use existing container signatures?
Another format with a variable file name is the MXL (Compressed MusicXML) format. Will have an XML inside container with identifiable root entry pattern, but name of XML is variable.
Container signatures currently require finding specific files within specific names within the zip container. In certain circumstances the file names may be variable. For Gnumeric, for example, and other formats that use GZIP-based compression as standard, the file usually shares the name of the GZ container. For Thumbs.db generated later than Windows XP, the contained file is named with what appears to be a partial checksum, with a pattern '256_xxxxxxxxxxxxxxxx' where each 'x' is a value from the hexadecimal range.
It would therefore be useful to be able to express container signatures with variably named files. This needs careful consideration as, for example a full wildcard name would mean that any ZIP file would attempt to scan any and all files contained within!