digital-preservation / droid

DROID (Digital Record and Object Identification)
BSD 3-Clause "New" or "Revised" License
278 stars 75 forks source link

DROID does not match inverted sequences or multiple-byte endian values. #806

Open nishihatapalmer opened 2 years ago

nishihatapalmer commented 2 years ago

The DROID spec states that:

• [a:b]: wildcard matching any sequence of bytes which lies lexicographically between a and b, inclusive (where both a and b are byte sequences of the same length, containing no wildcards, and where a is less than b). The endian-ness of a and b are the same as the endian-ness of the signature as a whole. e.g. 0xFF [09:0B] FF would match 0xFF 09 FF, 0xFF 0A FF or 0xFF 0B FF.

• [!a]: wildcard matching any sequence of bytes other than a itself (where a is a byte sequence containing no wildcards). e.g. 0xFF [!09] FF would match 0xFF 0A FF, but not 0xFF 09 FF.

However, DROID does not work with sequences of values inside square brackets, only single bytes. It interprets square brackets as a set of values matching at a single position, not a sequence of values, which is, I have to say, standard regular expression behavior.

There is some discussion of this in #805 with various approaches given to it (as it was identified in that separate issue).

nishihatapalmer commented 2 years ago

Note - I believe that making DROID work to the spec would be quite a large and tricky piece of work, and that existing signatures are actually working (mostly, with some potential for false negatives).

It would be substantially easier to update the spec to say it doesn't do this, and modify the few signatures that use this construction to avoid the false negatives.

nishihatapalmer commented 2 years ago

Regardless of whether DROID is eventually updated to support these features, we definitely have a few signatures which as currently defined will occasionally give a false negative. It might be prudent to fix those signatures in the meantime to work the way DROID actually works (the modification required is trivial), even if the longer term plan is to support the spec as originally written.

steve-daly commented 2 years ago

I agree that we should keep the DROID behaviour as-is (i.e. assuming expressions in square brackets are ranges, not sequences) and we'll go and update signatures to use alternative syntax.

It's not universally possible to replace the intent of the previous syntax with the legal DROID syntax as we can't make matches conditional.

For example, for [!FEFF] we could match as ??[!FF] or [!FE]?? but these would inadvertently exclude sequences such as FF01 or 01FE which should theoretically be allowed from the intended meaning of [!FEFF]. For these WAV formats though, the only real-world options have the 2nd byte (actually the first byte, as per the spec, due to endianness) as 0x00, so it's fine to stick to single byte comparison, or just rely on format priorities to deal with the alternative case (i.e. one format matches the specific case, and falls through to a format which catches the intent of the ! case.

Dclipsham commented 2 years ago

Just to note ref https://github.com/digital-preservation/droid/issues/805#issuecomment-1220500644 - the other three signatures that make use of [!xxxx] type patterns are each very obviously trying to express sequences, not ranges

Dclipsham commented 2 years ago

I appreciate it is not trivial to achieve this behaviour, but since 'not-range' can already (and as per spec) be expressed with e.g. [!AA:FF] and 'not-sequence' is self-evidently a useful thing to be able to express, I would be in favour of working towards that as an outcome

Dclipsham commented 2 years ago

see also https://github.com/digital-preservation/droid/issues/805#issuecomment-1220540470 - Siegfried works as per spec

nishihatapalmer commented 2 years ago

Well, it's all possible to do. There are two things though:

1) Multiple byte ranges, both positive and inverted, e.g. [0000:1234] and [!0000:1234]. Has to take endianness into account, and 16 bit, 24 bit and 32 bit values? 64 bit?

2) Not sequences, e.g. [!AABBCCDDEEFF] - will require a different form of search for those sequences.

Both of these forms cannot use the standard search algorithms - they can only be matched or not matched at specific positions. So the search devolves into checking for a match or not match at every position that needs to be checked, rather than the much faster types of search possible if you have no dependencies between different byte positions.

Luckily, both these types of construction aren't forming entire signatures on their own (which would kill performance), they're small fragments of larger signatures, which can use efficient searching to find places to check, so impact on overall performance will probably be negligible.

nishihatapalmer commented 2 years ago

Still think the syntax is horrid though - using square brackets for sequences of things when all other regular expression languages use it to mean a set of bytes at a single position is confusing! Still, it is what it is...

It does mean that if we ever wanted to be able to specify a set of arbitrary bytes in a set to match (which byteseek fully supports), we'd have to use a different syntax to indicate that! Still, I guess the chance of fundamentally adding new capabilities to the PRONOM syntax is probably not something anyone is currently planning to do...

EDIT: in fact, we can already do this by specifying a set of byte alternatives: (AA|BB|CC|DD). So I guess we're covered.

nishihatapalmer commented 2 years ago

We should also check the container signature signatures too - they have a slightly different syntax - would be good to ensure that if those constructions are being used, they aren't being used to actually indicate a set of bytes.

steve-daly commented 2 years ago

Given the differing interpretations currently, we'll see if we can re-write the signatures to avoid the ambiguity. I think the format priority will let this happen in most cases if it can't be resolved within a specific signature (e.g. match the very specific case positively, and by implication fall back to another format covering all the other cases, effectively the inverse of the first). Like dclipsham has suggested for fmt/142 etc

Dclipsham commented 2 years ago

In latest container signature there are positive choices, e.g. [22 27], & ranges [00:FF] but no inversions

Dclipsham commented 2 years ago

there is mixed use of range syntax though, e.g. both [00:FF] and [01-04] are present, as are ASCII ['6'-'7']

steve-daly commented 2 years ago

What does [22 27] mean? Is that intending to be [22:27] or (22|27) or simply 2227?

nishihatapalmer commented 2 years ago

Should probably standardise on using the : syntax which is PRONOM. The range hyphen is a bit of byteseek syntax leaking into the signatures, which isn't necessary. Could possibly put some additional checks in to stop that happening...

But the [22 27] is very definitely a set of bytes a single quote OR a double quote. If we want to get back to standard PRONOM syntax, this should be rewritten as (22|27). I'm very sure it doesn't mean a single quote followed by a double quote!

steve-daly commented 2 years ago

We'll probably devote a release to tidying some of this up. The Container signature XML will be generated programmatically now, rather than by hand, and it might be easier to standardise some of the signature representations rather than cover these variations.

Dclipsham commented 2 years ago

yes that specific one is expressing either 0x22 or 0x27 so either ASCII ' or " - in a binary sig we express that (22|27) and this is most commonly found in XML subset formats, but container sigs can't handle the raw (22|27)

nishihatapalmer commented 2 years ago

are you sure container sigs can't handle (22|27)? that seems odd, would have to check that in more detail. I thought container sigs could support any PRONOM syntax, but also had some minor syntactic sugar (mostly single quoted strings), e.g. 'ABC' rather than 414243.

Dclipsham commented 2 years ago

well for binary sigs that pronom syntax is precompiled into separate subsequences in the XML, but looking again you're absolutely right - container ID 3120 includes (00|01).

Dclipsham commented 2 years ago

its the wildcard syntax it can't deal with e.g {0-4}

nishihatapalmer commented 2 years ago

hmmmm.... well, this has raised all sorts of things to investigate in more detail. Once I've fixed the serialization inverted byte / byte range issue, I'll have to have a deeper look at that. Would be good to prevent the byteseek syntax to leaking into signatures.

The way it currently works, it transforms the few small differences from PRONOM syntax into byteseek syntax - e.g. ?? is replaced by . and then it's parsed by byteseek into a matcher. This means that if any byteseek syntax exists in the signature, it will be parsed by byteseek as valid, even though the original signature wasn't valid PRONOM.

In fact, we may already have a good way to do this, since the code that sigtool uses to transform between PRONOM and container style syntax is already essentially a validating parser for PRONOM. We could use that instead of the clunky "string replace then parse as byteseek" which is currently used. Then invalid syntax will be flagged.

However - need to be careful. If we alter the code so it no longer supports the wrong syntax, then the newer versions of DROID would not be able to process the older incorrect container syntax. So older signature files would not work anymore.

Dclipsham commented 2 years ago

Cool. I'll work through the container sigs as they are then and correct [xx xx] to (xx|xx) and [xx-xx] sequences to [xx:xx] ones for submission to the next release.

cc @thorsted who probably creates more container sigs than anybody else these days....