digital-preservation / droid

DROID (Digital Record and Object Identification)
BSD 3-Clause "New" or "Revised" License
285 stars 75 forks source link

Add WACZ to 'Analyse contents of archive files' #1087

Open Dclipsham opened 7 months ago

Dclipsham commented 7 months ago

Just to formalise a comment from #887,

fmt/1840 is emerging as a leading format for web archiving. Structurally it is a zip file containing a JSON manifest file and other structural elements along with payload data (see e.g. https://loc.gov/preservation/digital/formats/fdd/fdd000586.shtml and https://specs.webrecorder.net/wacz/1.1.1/)

Currently it identifies by Container Signature. It would be extremely useful to be able to recursively probe the contents of WACZ in the same manner it is already possible to probe the other web archive container formats, WARC and ARC.

kathaurielle commented 7 months ago

Belated thanks David! What can I advise the user about this change in DROID@s behaviour after the addition of fmt/1840?:

Prior to the addition of the wacz sig 1840 in v.110, wacz files were IDed as zip files, and the DROID report listed all the files inside them.

Now with the addition fmt/1840, however, DROID doesn't scan inside the files, it just IDs them as wacz, and outputs a one line DROID report.

Is a fix on the radar, where it scans inside wacz files? Thank you! KP.

Dclipsham commented 7 months ago

Well the reason is that a container signature identification is considered more definitive than a zip (because many container based formats use zip as their wrapper) so when it finds the WACZ elements it is looking for, DROID doesn't currently go further.

The answer is, as with the other web archive formats ARC/WARC, to program DROID to scan the contents of WACZ also. It's really a new feature request than a 'fix', and timing is a question for @steve-daly and @sparkhi as it requires development effort.

iholliew commented 4 months ago

@Dclipsham @kathaurielle I’m now all clear on why DROID no longer reports on the contents of WACZ files. Hopefully, there will be a fix for this in time. I can only say that from my perspective and the way that we use DROID in my workplace to analyse digital collections, it is more important for us to be able to inspect the contents of a WACZ file than for the WACZ file to have its own unique identifier. This is why we’re opting to use an earlier signature file that supports this functionality (specifically, ‘V109’), rather than later releases.

steve-daly commented 4 months ago

@Dclipsham @iholliew what would you like the behaviour to be if we added WACZ to the Web Archive archive formats. The reason it's more complicated than plan ZIP etc is that the WACZ spec gives more meaning to some of the contained files/structure in the WACZ file so parsing a WACZ files is more than just showing the contents arbitrarily. We could just treat WACZ as Zip when decoding this way, but the need to understand the JSON manifest (for example) caused us to pause on this.

Dclipsham commented 4 months ago

I personally would just like to decode the zip so it can be explored in the CSV output. I don't have a requirement for DROID to do anything clever with the JSON manifest

iholliew commented 4 months ago

Thanks @steve-daly. I second @Dclipsham's comments. I'm only interested in using DROID to report on the objects contained within the wacz file, like in 'v109'.