digital-preservation / droid

DROID (Digital Record and Object Identification)
BSD 3-Clause "New" or "Revised" License
285 stars 75 forks source link

GZIP as a container trigger #774

Open thorsted opened 2 years ago

thorsted commented 2 years ago

Our Rosetta Working group has identified a couple file formats which use GZIP instead of regular ZIP as a container of an existing format.

Could DROID add a "container" tigger and parser to identify files like this similar to ZIP/OLE?

CC2022-S01.prproj.zip

ross-spencer commented 2 years ago

@thorsted there's a bit of a discussion about GZIP here, I don't know exactly how close the gzip part of the discussion there is to what you're looking for here: https://github.com/digital-preservation/droid/issues/221

thorsted commented 2 years ago

@ross-spencer ohh, thank you. I have a vague memory of seeing this but only searched for open issues. Do you feel this is the right direction for identification or is there another way?

ross-spencer commented 2 years ago

@thorsted I haven't looked at it in a while. That piece was for a client (one of the statistical outputs of Dataverse) but I didn't have a massive amount of time to dig into it. I feel like we're both describing a clear two-step process - identify gzip -> identify the contents of the gzip which are known and together are their own discrete "thing" which looks a lot like container identification. Issue 221 had a broad scope which a few people strongly disagreed with, but I also feel that if technically it makes sense to treat gzip like a container, or it can be treated like a container, then there are benefits.

thorsted commented 1 year ago

Ran across another format based on gzip. The "Art of Illusion" AOI 3D format. Format is gzipped with a variable filename. Double problem. Art of Illusion

ross-spencer commented 1 week ago

JSONL is one I discovered today, it is recommended it can come in gz or bz2 to save space and so I suspect it would often be seen in this form (the one I ran into was).

https://jsonlines.org/

steve-daly commented 1 week ago

@MancunianSam do you think we might be able to look at this ticket at some point.