Open finoradin opened 9 years ago
bump
@finoradin Thanks – I'll work on a pull request for this. Strong +1 on checksumming the source files — the way I typically do things like that would be streaming reads at some large block size since that's also good for performance on network filesystems and it's a natural extension to feed each block into the hashers as well.
I'm willing to take this on. Would it be agreeable to add a --destination
option, which, when present, tells bagit.py
to draw on the bag_dir
args as sources and create the new bag in the destination
path?
I would, of course, checksum the files as described above.
@runderwood +1 – that should cover the most common use-cases
@acdha Clarification: With multiple sources, should:
I would vote for choice 3, one bag with each directory for each source
@johnscancella I can see that being useful, but I also find it counter-intuitive, since directories in bagit
world are usually transmuted into payloads at the root of data/
.
EDIT: I also should mention that merging directories seems consistent w/ some of the use cases that have come up for me. But that's also a bit weird, I suppose -- though maybe no weirder than in-place bagging.
FYI, I have a branch where bagging to a destination (with one source only) seems to work alright.
I would be proposing that given the source directories foo
bar
and ham
you would end up with something that looks like this:
├── bag-info.txt
├── bagit.txt
├── data
│ ├── bar
│ ├── foo
│ └── ham
├── manifest-md5.txt
└── tagmanifest-md5.txt
OK. Sounds good to me. I'll give it a shot.
@johnscancella Sorry, but does it make more sense to you to have the behavior remain the same with just one source, such that even when providing only one bag_dir
arg ham/
, the data directory likewise contains one directory with the same name? Something like this:
├── bagit.txt
├── data
│ └── ham
├── manifest-md5.txt
└── tagmanifest-md5.txt
...or should, in that case, the contents of ham
be dropped into the root of data
?
I would go with ham/
being dropped into the data directory since that is how it currently behaves.
i.e.
.
└── ham
├── bar
└── foo
becomes
.
├── bag-info.txt
├── bagit.txt
├── data
│ └── ham
│ ├── bar
│ └── foo
├── manifest-sha256.txt
├── manifest-sha512.txt
├── tagmanifest-sha256.txt
└── tagmanifest-sha512.txt
Currently the module only allows one to do what the LOC Java library calls "bag in place". It would be very useful to have built-in the ability to specify one or more payloads as the "source" and to then specify a "destination" where the bag containing the payloads will be created.
Minor but important note – the hashes in the manifest should be generated from the source payloads, not the copied files in the bag.