digital-preservation / PRONOM_Research

28 stars 10 forks source link

Suggest re-think on fmt/1875 - 'Maptech BSB Documentation File' #32

Closed Dclipsham closed 12 months ago

Dclipsham commented 1 year ago

fmt/1875 was introduced in v114. Its signature is looking for three relatively simple strings: 'VER/' 'NTM/' 'CED/'

However, because all of these are variably positioned, this means no anchor, and seems prone to clashes elsewhere. In my own test suite I have incidental clashes with a couple of very distinct formats: fmt/124 - Encapsulated PostScript File Format 3 fmt/468 - ISO 9660 Disk Image File

Further, because of the solely variably positioned sequences, this re-introduces some of the processing inefficiencies we tried to iron out in https://github.com/digital-preservation/droid/issues/906. It'll scan the entirety of a 100GB+ file trying in vain to find a match.

If there's no obvious quick fix I would suggest dropping the signature for now and making it extension-only for the time-being.

gleporeNARA commented 1 year ago

That's one of my signatures. All of the sample files I have show the strings within the first 700 bytes, so assigning that as the maximum offset would alleviate most of the problems.

As you know, some formats can only be identified by relatively short strings like this one. I thought VER/, NTM/, and CED/ would be unique enough to not produce false positives. Looks like I was wrong!

I am very much against extension-only signatures. Novice users, and even some more experienced ones, might see an extension only match and assume it's a positive match, rather than just being an extension only match.

Extension matches are (obviously) more prone to false positives than this signature.

Please try the signature with SubSeqMaxOffset="700", and let me know if you still have false positives.

Thanks for testing out the signature.

Dclipsham commented 1 year ago

Capping it would make me very happy - even if it was relatively large, e.g. max offset 100,000 bytes from BOF (although that doesn't sound needed here) would greatly reduce the potential for false positives, and be much more performant. Thank you for the input Greg!

tnafrancesca commented 1 year ago

This will be in the next release, I'm uploading the latest DevTest signature for testing. I'll close this issue once v.115 has been released.

tnafrancesca commented 12 months ago

This should be updated in v.116