digital-preservation / PRONOM_Research

26 stars 8 forks source link

Proposal to shorten AVI signature #28

Closed Dclipsham closed 1 year ago

Dclipsham commented 1 year ago

This may touch upon issues for discussion as part of #27 but I would like to suggest shortening the fmt/5 - AVI signature.

The current signature consists of three parts: 52494646{4}41564920*4C495354{4}6864726C61766968*4C495354{4}6D6F7669 'RIFF'{4}'AVI ', that we seek from offset 0 Then anywhere in the file following the RIFF chunk: 'LIST'{4}'hdrlavih' Then anywhere in the file following the hdrl LIST chunk: 'LIST'{4}'movi'

This isn't incorrect - it follows the expectations of the AVI file format as described here: https://learn.microsoft.com/en-us/windows/win32/directshow/avi-riff-file-reference

The problem is, it's pretty inefficient - AVI files can be huge (239GB is the largest I've seen), and so seeking the LIST elements can take a very long time (over an hour in the above case), and although they're mandatory elements, I don't think seeking the LIST chunks is providing much identification strength beyond what we're already getting with the RIFF AVI chunk.

I propose shortening the signature to just the 'RIFF{4}AVI ' part. fmt/5 is the only file format in PRONOM that has this sequence so we won't clash with anything already known to PRONOM, and I believe it to be highly unlikely for any other format based on RIFF to use 'AVI ' as its FOURCC identifier (absent of any potential future proposals for trying to identify media container & codec variants through PRONOM).

This therefore would make the signature just: 52494646{4}41564920

Happy to hear any thoughts on this.

David

prettybits commented 1 year ago

I agree with this, independently of the outcome of the discussion on identifying specific codec variants (or extensions like OpenDML for AVI). The main RIFF Data Specification says this about Defining and registering RIFF forms:

The form-type code for a RIFF form must be unique. To guarantee this uniqueness, you must register any new form types before release.

and

By convention, the form-type code for registered form types contains only digits and uppercase letters. Form-type codes that are all uppercase denote a registered, unique form type. Use lowercase letters for temporary or prototype chunk types.

so it should be perfectly sufficient to rely on just the AVI RIFF form type code for basic identification of AVI files. Additional format entries for subtypes like AVI with OpenDML extensions ("AVI 2.0") or "DV AVI" files are probably a good idea to have as well, but that would likely bring up the same performance concerns again?

nkrabben commented 1 year ago

Would it be useful to have a generic container signature for RIFF? That would allow tools to catch any RIFF-based file and signal the need for a more specific format signature. It would also allow to group all of the RIFF formats more easily in the database. Right now, you have to run searches for RIFF and Resource Interchange File Format.

I'm thinking about proposing similar PRONOM signatures for ISO BMFF / MPEG-4 Part 12 files.

Dclipsham commented 1 year ago

Would it be useful to have a generic container signature for RIFF? That would allow tools to catch any RIFF-based file and signal the need for a more specific format signature. It would also allow to group all of the RIFF formats more easily in the database. Right now, you have to run searches for RIFF and Resource Interchange File Format.

I'm thinking about proposing similar PRONOM signatures for ISO BMFF / MPEG-4 Part 12 files.

Yes I think that would be useful. Signature would presumably be 'RIFF' 0x52494646, which isn't the strongest in the world, but I've come across worse.

Note the current list of formats that start 'RIFF': fmt/1 Broadcast WAVE 0 Generic fmt/141 Waveform Audio (PCMWAVEFORMAT) fmt/143 Waveform Audio (WAVEFORMATEXTENSIBLE) fmt/2 Broadcast WAVE 1 Generic fmt/386 Microsoft Animated Cursor Format fmt/427 CorelDraw Drawing 12.0 fmt/428 CorelDraw Drawing X3 fmt/431 Corel R.A.V.E. 2 fmt/432 Corel R.A.V.E. 3 fmt/464 CorelDraw Drawing 5.0 fmt/465 CorelDraw Drawing 4.0 fmt/5 Audio/Video Interleaved Format fmt/527 Broadcast WAVE 2 Generic fmt/566 WebP Lossy fmt/567 WebP Lossless fmt/568 WebP Extended fmt/6 Waveform Audio fmt/624 RIFF Palette Format fmt/703 Broadcast WAVE 0 PCM Encoding fmt/704 Broadcast WAVE 1 PCM Encoding fmt/705 Broadcast WAVE 2 PCM Encoding fmt/706 Broadcast WAVE 0 MPEG Encoding fmt/707 Broadcast WAVE 1 MPEG Encoding fmt/708 Broadcast WAVE 2 MPEG Encoding fmt/709 Broadcast WAVE 0 WAVEFORMATEXTENSIBLE Encoding fmt/710 Broadcast WAVE 1 WAVEFORMATEXTENSIBLE Encoding fmt/711 Broadcast WAVE 2 WAVEFORMATEXTENSIBLE Encoding x-fmt/222 CD Audio x-fmt/29 CorelDraw Drawing 6.0 x-fmt/291 CorelDraw Drawing 7.0 x-fmt/292 CorelDraw Drawing 8.0 x-fmt/33 Corel R.A.V.E. 1 x-fmt/34 Corel Presentation Exchange File 5.0 x-fmt/35 Corel Presentation Exchange File 6 and Later x-fmt/374 CorelDraw Drawing 9.0 x-fmt/375 CorelDraw Drawing 10.0 x-fmt/378 CorelDraw Drawing 11.0 x-fmt/379 CorelDraw Drawing 3.0 x-fmt/389 Exchangeable Image File Format (Audio) 2.1 x-fmt/396 Exchangeable Image File Format (Audio) 2.2 x-fmt/397 Exchangeable Image File Format (Audio) 2.0

thorsted commented 1 year ago

I recently submitted a bunch of CorelDRAW updates. https://github.com/thorsted/PRONOM_Research/tree/main/Submissions/CorelDRAW%20Updates

They started packaging as a ZIP with the RIFF data inside.

prettybits commented 1 year ago

Would it be useful to have a generic container signature for RIFF? That would allow tools to catch any RIFF-based file and signal the need for a more specific format signature. It would also allow to group all of the RIFF formats more easily in the database.

I think that would be valuable as well and be covered with the simple signature David mentions. The shared-mime-info database for comparison also has an explicit entry for application/x-riff, calling it "RIFF Container" - personally I would probably call it "RIFF Tagged File" to stay closer to the original spec and reduce confusion with the "container" term as used for container signatures. The spec also mentions the "RIFX" variant with Motorola integer byte-ordering, this could be added in the same go?

Existing format signatures for RIFF-based formats would then need priorities added, I'm wondering whether there's (or should be) some mechanism to mark a signature as generic as this is (quoting the spec) "not actually a file format itself", to have a more general means to filter recognized types to those which only define a structure and need further looking into to find the concrete format without specifying a specific format structure beforehand. Container signatures as currently defined and used (i.e. expecting "Files" inside the container) wouldn't fit (yet?) and the @Specificity attribute on InternalSignatures seems to have been deprecated at some point. Is there some history there?

nkrabben commented 1 year ago

In a similar vein, it might be good to have a generic signature for IFF parent. Right now, PRONOM only contains 2 IFF-formats. fmt/135 fmt/338 I'm mostly parking this as an idea as we start looking into our collections with Amiga files.

Dclipsham commented 1 year ago

IFF already has a parent ID - https://www.nationalarchives.gov.uk/PRONOM/x-fmt/157

nkrabben commented 1 year ago

Huh, can't believe I missed that. 🤦🏻‍♂️

tnafrancesca commented 1 year ago

So this proposal was a feature in v.114. If everyone is happy I will close the issue.

tnafrancesca commented 1 year ago

Closing issue as I believe it has now been resolved