Add normative references that define the digest algorithms and the specific encodings

zimeon commented 6 years ago

PR #36 adds a table of digest algorithms "known" to OCFL. We need to link these to definitions of the algorithms and the encodings that are assumed in OCFL

zimeon commented 6 years ago

I think a suitable set of references would be:

md5 - MD5 algorithm defined, including encoding, in RFC1321
sha1 - SHA-1 defined in RFC3174/FIPS 180-3 and MUST be encoded using base64 encoding RFC4648
sha256 - SHA-256 defined in FIPS 180-3 and MUST be encoded using base64 encoding RFC4648
sha512 - SHA-512 defined in FIPS 180-3 and MUST be encoded using base64 encoding RFC4648

md5 and sha1 are as defined in bagit

there is a question of whether we should reference the updated FIPS 180-4 instead of FIPS 180-3 which actually has a DOI: https://doi.org/10.6028/NIST.FIPS.180-4

awoods commented 6 years ago

:+1: to normative references of digest algorithms. Is there any reason not to use FIPS 180-4?

zimeon commented 6 years ago

Probably no reason to avoid FIPS 180-4 but I note that the IANA registry and RFC5843 cite FIPS 180-3

zimeon commented 6 years ago

@ahankinson noted on comments for #34: This section in the output of the shasum utility explains what the * is for:

The sums are computed as described in FIPS-180-4. When checking, the input should be a former output of this program. The default mode is to print a line with checksum, a character indicating type ('*' for binary, ' ' for text, '?' for portable, '^' for BITS), and name for each FILE.

zimeon commented 6 years ago

OK, it is scary that the standard shasum defaults to text mode on unix, we need somewhere an implementation note that says one should use -b / --binary. See confusion about this in relation to the bagit spec in https://github.com/LibraryOfCongress/bagit-java/issues/69 and the notes about possible leading asterisk in the current bagit draft: https://tools.ietf.org/html/draft-kunze-bagit-16#section-6.1.3

IMO, we should not perpetuate any fuzziness here. We should define calculation of checksums in terms of the appropriate specs and in terms of the require encoding of the output (base64). We should not allow different preprocessing modes for text and line endings. I think perhaps the most helpful way to offer a check here would be include in the spec example checksum outputs for each digest type for a particular file content that might be changed if anything other than straight/binary processing were applied.

Having said that, I'm not sure how to generate something that gives different outputs for binary/text/portable modes of shasum. I tried the following which writes all bytes 0 to 255, and various line ending combinations:

simeon@RottenApple ~> perl -e 'open(my $fh, "> :raw :bytes", "a"); foreach $j (0...255) { print {$fh} "$j ".chr($j)."\n"; } print {$fh} "line endings: \r\n \n\r \r \n"; close($fh)'
simeon@RottenApple ~> shasum -t a; shasum -b a; shasum -p a
f7867717259f8026e014e4c56e1b4683c049e80c  a
f7867717259f8026e014e4c56e1b4683c049e80c *a
f7867717259f8026e014e4c56e1b4683c049e80c ?a

zimeon commented 6 years ago

I now see the following in the bagit spec:

md5sum can be run in "text mode" which causes it to normalize line- endings on some operating systems. On Unix-like systems both modes will usually produce the same results but on systems like Windows they can produce different results based on the file contents.

so maybe the test above is expected to have consistent outputs on Unix but might produce different results on a Windows box.

OCFL / spec

Add normative references that define the digest algorithms and the specific encodings #39