LibraryOfCongress / bagit-java

Java library to support the BagIt specification.
Other
73 stars 49 forks source link

SHA-1 manifest file format does not work with shasum #69

Closed whikloj closed 7 years ago

whikloj commented 7 years ago

When submitting an issue please include:

Running 5.0.0-BETA OS X 10.11.5

Please format it in the given when then style

For example (from link above):

Given

When

Then

We are using this library to develop an import/export feature using BagIt bags. However in the test implementation the manifest-SHA1.txt that is generated cannot be parsed by OSX's shasum command.

For example the file appears as follows.

199DD5F71C1610D99D0E163C9E665FC99ECED911 data/rest/Department3.University0.edu/AssociateProfessor6/Publication10.jsonld
53EBFA7C03515FA0ABC674F8C6CEDE9619CE2E26 data/rest/Department3.University0.edu/GraduateCourse13.jsonld
E99A10E72C10839C5B4CB9D7772575450481E8AB data/rest/Department3.University0.edu/AssociateProfessor7/Publication14.jsonld
62CCBE962497D11A2A333793AB0EADDB918A6EF1 data/rest/Department3.University0.edu/FullProfessor3/Publication16/Proxy.jsonld

However

> shasum -a1 -c manifest-SHA1.txt
shasum: manifest-SHA1.txt: no properly formatted SHA1 checksum lines found

I can't find a standard on what the format of this a SHA-1 file is supposed to be, but it seems like there is supposed to be an asterisk before the filename. In the BagIt spec that has been specified as optional, but if I add either an additional space or an asterisk before the file path then the lines are read perfectly.

ie. This fails to parse

199DD5F71C1610D99D0E163C9E665FC99ECED911 data/rest/Department3.University0.edu/AssociateProfessor6/Publication10.jsonld

But this

199DD5F71C1610D99D0E163C9E665FC99ECED911 *data/rest/Department3.University0.edu/AssociateProfessor6/Publication10.jsonld

or this

199DD5F71C1610D99D0E163C9E665FC99ECED911  data/rest/Department3.University0.edu/AssociateProfessor6/Publication10.jsonld

is readable as valid formatted SHA-1 checksum lines

johnscancella commented 7 years ago

https://tools.ietf.org/html/draft-kunze-bagit-14#section-2.1.3 specifies that the form is CHECKSUM FILENAME. We are currently working to make the specification more clear, but any amount of spaces are allowed. However, for the purpose of using shasum, you should always being using the binary mode (i.e. putting the * in front of the path). Bagit is designed to be used by Bagit, not the other way around. We mention using shasum as a way to manually create bagit bags (not not encouraged!).

Feel free to submit a pull request for this feature.

whikloj commented 7 years ago

Sorry I was referring to the statement An asterisk ('*') MAY preceed FILENAME for interoperability on some platforms (see Section 7.2.1).

But I get your point that BagIt is only for BagIt.

johnscancella commented 7 years ago

Right, but if you read that section it says

Implementors creating and processing bags with other tools might wish to be tolerant of asterisks found in the manifests.

Notice it doesn't have the MUST or SHOULD keywords, and thus is not really part of the specification. I was not there when the specification was originally drafted, but I believe the intent was to give some options when a bagit compliant tool was not available on a server but the user still wanted to use the specification to ensure the content was unchanged.

acdha commented 7 years ago

As a similar newcomer to the project, that was my impression as well: people might have had existing use of md5sum or shasum and wanted to be able to reuse existing manifests rather than re-reading a large amount of data which might even have been aged off to tape by that point.

If you have a use-case where it makes sense to use one of these tools rather than a command-line bagit utility, it might be good to expound upon it here so it could be reflected in the spec. It seems like it could be confusing if it was only implemented in bagit-java and someone's workflow breaks if a different tool was used at some point.

whikloj commented 7 years ago

@acdha I'm not really sure I have a good use case. We were using this library to format our exports as "bags" and looking at the manifest it appeared that we could have people use normal tools like md5sum and shasum to verify the checksums. But what I am hearing is that to verify a bag you need to use a BagIt tool. That's fine as if someone is going to export using the BagIt profile then they probably will have the necessary tools installed to verify the bag.

whikloj commented 7 years ago

So I'll close this. Thanks for the quick response all!

escowles commented 7 years ago

@acdha @johnscancella: A little context might help here: we're adding support for Bags to Fedora's new import/export tool. So mostly it's developers trying to quickly verify the Bags we're exporting with whatever tools we have on-hand. I think this is a use case worth considering, though — I suspect many people who generate Bags have Bag-specific validation tools, but some may not. And even if you do have Bag-specific tools, it may be more convenient to do a quick check with other tools like md5sum/shasum. So far, all of the non-Bag tools we've tried have failed (hence this ticket, and also https://jira.duraspace.org/browse/FCREPO-2354?focusedCommentId=53229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-53229). Because the manifest files look very similar to the md5sum/shasum output files, this is pretty confusing.

My reading of the spec makes me think that using lowercase letters in the checksums, and adding an asterisk before the path name, would still be valid. They would also then be compatible with a wider variety of validation tools.

johnscancella commented 7 years ago

@escowles I looked at your ticket. While adding a space will work some of the time, it would be better to use the , because bagit expects to calculate the checksum over the bytes of the file (which is what the in the md5sum tools means). By not doing it in binary mode we have encountered some strange results, so better to just always use *. We talk more about this in the bagit specification here

What this means is that we would have to change the bagit specification to always put * in front of the path. @acdha what are your feelings about this?

As for the case of the checksum lettering, I think that is an issue with the Ruby implementation. The library of Congress does not maintain the ruby implementation, so you will have to file a bug with them.

escowles commented 7 years ago

@johnscancella I think it's worth considering updating the spec to be clearer in this area. Ideally, the format would be compatible with md5sum/shasum, using " " or " " as the separator between the checksum and filename. In the meantime, I think it would be nice to use " " as the separator, to be more compatible with common tools, and that would still be valid.

Also, we've filed a bug against the Ruby BagIt library we were using for not handling checksums with uppercase letters: https://github.com/tipr/bagit/issues/27 — and the method we were using to convert a binary checksum into a hex string was what was creating uppercase letters, so we can easily change that locally.

acdha commented 7 years ago

I'm 👍 on always using lowercase checksums since that should work with everything and I believe we already have tests for both in https://github.com/loc-rdc/bagit-conformance-suite/ as well as bagit-python. (Good thing to check, however)

I'm a somewhat reluctant 👎 on a leading asterisk by default since while that's clearly the best way to avoid problems with text mode on Windows (I hope nobody is bagging stuff on VMS but it's a big world…) it hasn't been required in the past and I would be surprised if didn't break some existing tools. Playing conservatively, I'd be tempted to suggest that we ensure that the separator by default is two spaces since that will work with both any complaint BagIt library and with the GNU coreutils *sum tools on any platform other than Windows. It looks like even Windows users would be covered if they used --binary to validate files, but I haven't tested that.

acdha commented 7 years ago

@escowles Speaking of Ruby BagIt, have you / @tipr seen https://github.com/loc-rdc/bagit-conformance-suite/ before? We haven't announced it but @johnscancella @dbrunton and I are now officially spending time supporting the spec and implementations which we use, and one of my goals is to have that suite available when https://github.com/loc-rdc/bagitspec/pull/1 lands. I've been planning to email the larger community as soon as we reach internal consensus but mostly it's been adding more edge-case handling & pedantic warnings rather than major changes.