LibraryOfCongress / bagit-python

Work with BagIt packages from Python.
http://libraryofcongress.github.io/bagit-python
218 stars 85 forks source link

Support bagging to a destination other than the source #35

Open finoradin opened 9 years ago

finoradin commented 9 years ago

Currently the module only allows one to do what the LOC Java library calls "bag in place". It would be very useful to have built-in the ability to specify one or more payloads as the "source" and to then specify a "destination" where the bag containing the payloads will be created.

Minor but important note – the hashes in the manifest should be generated from the source payloads, not the copied files in the bag.

finoradin commented 8 years ago

bump

acdha commented 8 years ago

@finoradin Thanks – I'll work on a pull request for this. Strong +1 on checksumming the source files — the way I typically do things like that would be streaming reads at some large block size since that's also good for performance on network filesystems and it's a natural extension to feed each block into the hashers as well.

runderwood commented 7 years ago

I'm willing to take this on. Would it be agreeable to add a --destination option, which, when present, tells bagit.py to draw on the bag_dir args as sources and create the new bag in the destination path?

I would, of course, checksum the files as described above.

acdha commented 7 years ago

@runderwood +1 – that should cover the most common use-cases

runderwood commented 7 years ago

@acdha Clarification: With multiple sources, should:

  1. ...multiple bags be generated?
  2. ...one bag be generated with the sources merged?
  3. ...one bag be generated with directories for each source?
johnscancella commented 7 years ago

I would vote for choice 3, one bag with each directory for each source

runderwood commented 7 years ago

@johnscancella I can see that being useful, but I also find it counter-intuitive, since directories in bagit world are usually transmuted into payloads at the root of data/.

EDIT: I also should mention that merging directories seems consistent w/ some of the use cases that have come up for me. But that's also a bit weird, I suppose -- though maybe no weirder than in-place bagging.

runderwood commented 7 years ago

FYI, I have a branch where bagging to a destination (with one source only) seems to work alright.

johnscancella commented 7 years ago

I would be proposing that given the source directories foo bar and ham you would end up with something that looks like this:

├── bag-info.txt
├── bagit.txt
├── data
│   ├── bar
│   ├── foo
│   └── ham
├── manifest-md5.txt
└── tagmanifest-md5.txt
runderwood commented 7 years ago

OK. Sounds good to me. I'll give it a shot.

runderwood commented 7 years ago

@johnscancella Sorry, but does it make more sense to you to have the behavior remain the same with just one source, such that even when providing only one bag_dir arg ham/, the data directory likewise contains one directory with the same name? Something like this:

├── bagit.txt
├── data
│   └── ham
├── manifest-md5.txt
└── tagmanifest-md5.txt

...or should, in that case, the contents of ham be dropped into the root of data?

johnscancella commented 7 years ago

I would go with ham/ being dropped into the data directory since that is how it currently behaves. i.e.

.
└── ham
    ├── bar
    └── foo

becomes

.
├── bag-info.txt
├── bagit.txt
├── data
│   └── ham
│       ├── bar
│       └── foo
├── manifest-sha256.txt
├── manifest-sha512.txt
├── tagmanifest-sha256.txt
└── tagmanifest-sha512.txt