fair-research / bdbag

Big Data Bag Utilities
https://fair-research.org
Apache License 2.0
49 stars 23 forks source link

creating bag from remote-file-manifest fails - quick fix proposed #26

Closed mjkrause closed 5 years ago

mjkrause commented 5 years ago

Hi, I created a JSON in Python from the remote file manifest in the README and wrote it to disk while developing. Then I tried to create a BDBag using bdbag_api.make_bag from it.

Problem: RuntimeError: Expected the same number of files for each checksum Solution: The reason (I believe) is that one file has both, an md5 sum and a sha256, while the other has only the sha256. it worked when after removed the md5 sum key and value from the first file. Note sure if this confuses others, but I would fix it so that a sample manifest can be created and used to verify bag creation during dev.

mikedarcy commented 5 years ago

This is a somewhat complex issue that is rooted in upstream code and changes to the bagit spec, where both the bagit spec writers and the bagit-python developers have now (for bagit spec 1.0) declared as incomplete bags that contain this kind of heterogeneous mixing of checksum types. I've detailed this a bit more in the release notes for 1.3.0 here. It is stated in the most recent draft of the spec here.

For what its worth, this is a change to the spec that I am very much in disagreement with. It seems like an entirely artificial limitation to me. I think I have a reasonable solution for this in bdbag, which involves allowing the user to specify the spec version compliance via the bdbag.json config file and API parameter. Specifying 0.97 would allow the less-strict heterogeneous mixing of files in multiple manifests, and specifying 1.0 would make bag creation strictly adhere to the requirements of the current spec. I would also make the 0.97 compatibility mode the default.

It was probably a mistake to include this rather aggressive upstream change into bdbag when we already had support for this kind of payload manifest mixing from the beginning, and it is actually quite a useful feature. I will rectify this in the next release of bdbag. Going forward, if there are those who need strict compliance with bagit version 1.0, they need only toggle it via configuration parameter.

mjkrause commented 5 years ago

Hi Mike, I've seen you presenting on BDBags in at least of the Data Commons / KC7 calls. Thanks for the detailed response, awesome stuff. Your comment is highly relevant to my work here, but some of it is over my head, but I'll read up on it. I was mostly concerned that people like me would go in and use the posted remote-file-manifest and create a bag from it, and then fail. At any rate, thanks for taking action. Best wishes

mikedarcy commented 5 years ago

I've updated the sample remote file manifest block in the config doc so that it won't trigger the error if used verbatim. I'm going to close this out and track the broader issue in #27. Thanks for reporting!