Open thorsted opened 2 years ago
@thorsted there's a bit of a discussion about GZIP here, I don't know exactly how close the gzip
part of the discussion there is to what you're looking for here: https://github.com/digital-preservation/droid/issues/221
@ross-spencer ohh, thank you. I have a vague memory of seeing this but only searched for open issues. Do you feel this is the right direction for identification or is there another way?
@thorsted I haven't looked at it in a while. That piece was for a client (one of the statistical outputs of Dataverse) but I didn't have a massive amount of time to dig into it. I feel like we're both describing a clear two-step process - identify gzip
-> identify the contents of the gzip which are known and together are their own discrete "thing"
which looks a lot like container identification. Issue 221 had a broad scope which a few people strongly disagreed with, but I also feel that if technically it makes sense to treat gzip
like a container, or it can be treated like a container, then there are benefits.
Ran across another format based on gzip. The "Art of Illusion" AOI 3D format. Format is gzipped with a variable filename. Double problem. Art of Illusion
JSONL is one I discovered today, it is recommended it can come in gz or bz2 to save space and so I suspect it would often be seen in this form (the one I ran into was).
@MancunianSam do you think we might be able to look at this ticket at some point.
Our Rosetta Working group has identified a couple file formats which use GZIP instead of regular ZIP as a container of an existing format.
Could DROID add a "container" tigger and parser to identify files like this similar to ZIP/OLE?
CC2022-S01.prproj.zip