LibraryOfCongress / bagit-spec

8 stars 7 forks source link

Convention for ZIP archiving a bag? #39

Open stain opened 5 years ago

stain commented 5 years ago

Earlier drafts of bagit included a section on Serialization that recommended a bag was archived with a single folder in the zip/tar that then had bagit.txt etc. (rather than bagit.txt being in the root).

This was removed in 94bcdaaa310f41f53844ae2d25afcba2306b58cb by @acdha but I can't see the discussion that led to this.

Is this still the convention for ZIP or .tar archiving a bag? I see that https://github.com/fair-research/bdbag by @mikedarcy et al assumes opposite - it will only validate correctly a zip if bagit.txt is in the root.

I am asking because we are making such archives for downloading a bag, and obviously would want to be on the right side of the convention :)

stain commented 5 years ago

I can see arguments for both approaches:

root: predictable location, e.g. you don't need to do an effective zip/tar file listing to recursively find bagit.txt (the manifest files should give the file listing anyway). Easier to "mount" the archive for in-memory access.

directory: less surprise for naive unzipping, don't pollute your download/tmp directory with lots of files, same convention as software distribution archives. Preserves bag directory "name" as if it had been on the file system.

jscancella commented 5 years ago

I don't have access to the emails where we had the discussion anymore, but if I remember correctly we removed it because it is really outside the scope of the bagit spec. Meaning, we shouldn't really care how you zip it because you will have to unzip it before validation.

stain commented 5 years ago

That's all well and good, but for a computer it is hard to consistently do "just unzip it".. so we wondered if it was still best practice to do as in the old spec, or if our "flat" ZIP as in bdbag is preferable?

jscancella commented 5 years ago

Sounds like it needs its own specification then. How you compress or otherwise change folders and files on an operating system is beyond the scope of bagit specification. Its main purpose is to ensure all files have been transferred correctly and completely.

jkunze commented 4 years ago

Sorry to enter this discussion so late. The reason I originally wrote up (I think it was me) the serialization section was to avoid the nasty mess that can be created when unzipping/untarring. Opening a serialization can be a much more civilized experience with suitable constraints. While not part of a bag, being only one layer removed a serialization is an important bag interface. I don't recall exactly why the section was removed.