digital-preservation / droid

DROID (Digital Record and Object Identification)
BSD 3-Clause "New" or "Revised" License
278 stars 75 forks source link

MD5 File Identification Conflict with Aldus FreeHand Drawing Signature #1004

Open JohannesKarlsen99 opened 1 year ago

JohannesKarlsen99 commented 1 year ago

We are encountering challenges when attempting to identify MD5 files. Specifically, there are instances where the content of the MD5 file aligns with the signature of an Aldus FreeHand Drawing file.

We've made efforts to rectify this by using the checkForExtensionMismatches method within the binarySignatureIdentifier class. However, we've noted that this version of Aldus doesn't have a file extension. This results in no mismatch being detected. Upon inspecting the code, it seems that this behavior is deliberate.

This raises a concern: if a file format is defined without a file extension in its specification, then any file with that signature that also contains a file extension should be deemed inaccurate. The current implementation seems counterintuitive, as it doesn't align with the specified file format criteria.

If the design decision to avoid mismatches for formats without extensions was intentional, please provide clarity on the reasoning behind it. Given the described scenario, it appears to introduce errors in file identification.

sparkhi commented 1 year ago

Could you please share an example file here so we can explore further. Also, can you please confirm whether you are using the GUI or CLI?

JohannesKarlsen99 commented 1 year ago

To clarify, we are using an implementation that integrates DROID's capabilities into our application. Specifically, we have developed code that leverages DROID's functionalities for file identification.

I've attached an example file. example.md5.zip

Dclipsham commented 1 year ago

Interesting scenario, not covered in the original FF ID documentation (https://www.nationalarchives.gov.uk/aboutapps/fileformat/pdf/automatic_format_identification.pdf see section 3. starting at page 13).

As the signature for Freehand 1 is only using ASCII characters from the 0-f range there's a small chance (1 in 65,536 rather than the 1 in ~4billion you'd normally expect for an entirely random clash for a 4-byte signature) that this could have false positives for formats such as MD5, SHA-1, SHA-256 etc that are internally just a stream of hex represented as ASCII (albeit usually with a filepath appended if generated with utilities such as md5sum, sha256sum etc).

It may be possible for the Freehand signature to be strengthened - I'd probably be looking towards @thorsted for advice there.

Regarding the behaviour where a format has an empty external signature, I can see arguments in both directions. Particularly in the Macintosh world (as in this specific case) extensions were sometimes used by convention rather than specification, so a lack of an official extension doesn't preclude people from using them, however normally when a PRONOM external signature field has been left empty its because of a lack of a clear convention, so I don't think it would be a bad thing to have a mismatch flag where a format entry lacks an associated extension, but a file instance has one.

thorsted commented 1 year ago

Interesting issue. More of a PRONOM issue than DROID.

David is correct in regards to file extension. There are many Macintosh only formats which where never assigned an extension as they were unnecessary in the MacOS. Freehand versions 1 & 2 are examples of this. Although based on later versions when the software was cross-platform, one could assign .FH1 and .FH2, but I don't agree this is necessary as the original files will never have this extension in the real world.

Identification of MD5 files is the more difficult issue. A defined binary signature will always have priority over an extension only signature.

That being said, I can look at strengthening the FreeHand 1 signature to include more bytes, but who is to say once that is released your MD5 files won't clash with another signature?

JohannesKarlsen99 commented 1 year ago

I acknowledge the potential value in strengthening the FreeHand 1 signature, as this might help reduce the number of false positives. That said, my primary concern revolves around the behavior of mismatch detection when a file format's signature lacks a defined extension. In our integration of DROID, several md5 files were initially misidentified. Although the checkForExtensionMismatches method resolved many of these, I'm inclined to think that when a file has an extension and the signature doesn't, it should trigger a mismatch.