Problem: md5deep checksum validation for multiple files with same checksum

mjaddis commented 5 years ago

Expected behaviour If I have a transfer with multiple files that happen to have the same checksum and these are listed correctly in checksum.md5 in the metadata directory, then I expect checksum validation to pass at the Transfer stage of the workflow.

Current behaviour Checkum validation fails with error code 110. However, logs from the script show that the md5deep tests actually passed, for example:

PASSED
/var/archivematica/sharedDirectory/currentlyProcessing/md5deep1-93325aa1-1a22-46f6-a815-1adec5b47fbe/objects/earth1.jpg
/var/archivematica/sharedDirectory/currentlyProcessing/md5deep1-93325aa1-1a22-46f6-a815-1adec5b47fbe/objects/earth2.jpg

2 items passed integrity checking

FAILED

0 items failed integrity checking

Steps to reproduce create a transfer containing two files that have different filenames but the same checksum. Use md5deep to generate a manifest and put this in checksum.md5. For example:

6b72dc8ff4cd45df12e971a466fabc09  objects/earth1.jpg
6b72dc8ff4cd45df12e971a466fabc09  objects/earth2.jpg

It looks like md5deep is exiting with a 1 when there is more than one entry in the manifest that matches a given file in the objects directory. If there aren't duplicates then it exits 0. However, ignoring the exit code, actual the test still passes. The Archivematica script concatenates the return codes of md5deep -r -m, md5deep -r -x and the count of failed files. Hence the 110 code.

Better perhaps to use md5sum -c instead of md5deep as this would involve the correct entry in the manifest being used agains the correct payload file in the objects dir?

Run this as a standard Transfer.

Your environment (version of Archivematica, OS version, etc) AM1.7.2 on 16.04 LTS

For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

All PRs related to this issue are properly linked 👍
All PRs related to this issue have been merged 👍
Test plan for this issue has been implemented and passed 👍
Documentation regarding this issue has been written and it has been added to the release notes, if needed 👍

ross-spencer commented 5 years ago

Thanks @mjaddis.

Seeing this I was concerned about what was happening, and whether there was a possibility that spurious results could be generated, or not all errors may be reported when using the md5deep hash-check, e.g. could files/hashes be missed if not listed correctly?

I modified the script archivematicaCheckMD5NoGUI.sh to generate a lot of additional output, here:

#!/bin/bash
#
# This file is part of Archivematica.
#
# Copyright 2010-2013 Artefactual Systems Inc. <http://artefactual.com>
#
# Archivematica is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Archivematica is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with Archivematica.  If not, see <http://www.gnu.org/licenses/>.

# @package Archivematica
# @subpackage archivematicaClientScript
# @author Joseph Perry <joseph@artefactual.com>
# @version svn: $Id: 09716783f2866a77f1240067a610b19103f69c1b $

#create temp report files
UUID=`uuid`
failTmp=/tmp/fail-$UUID
passTmp=/tmp/pass-$UUID
reportTmp=/tmp/report-$UUID

checkFolder="$1"
md5Digest="$2"
integrityReport="$3"
checksumTool="$4"

tmpDir=`pwd`
ret=0

cd "$checkFolder"

echo "^^^"
echo "-- Current directory listing:"
echo
echo "$(ls)"
echo
echo "^^^"
echo "-- Provided MD5 digest:"
echo
echo "$(< $md5Digest)"
echo
echo "^^^^^"

#check for passing checksums
"${checksumTool}" -w -l -r -M "$md5Digest" . > $passTmp
rx="$?"
echo
echo "-- relative exit code for -m flag"
echo $rx
echo
echo "-- passTmp output"
echo
echo "$(< $passTmp)"
echo

ret+=$rx

#check for failing checksums
"${checksumTool}" -w -l -r -X "$md5Digest" . > $failTmp
ry="$?"
echo
echo "-- relative exit code for -x flag"
echo $ry
echo
echo "-- failTmp output"
echo
echo "$(< $failTmp)"
echo

ret+=$ry

# ------------

#check for passing checksums
"${checksumTool}" -w -r -M "$md5Digest" . > $passTmp
rx="$?"
echo
echo "-- non-relative exit code for -m flag"
echo $rx
echo
echo "-- passTmp output"
echo
echo "$(< $passTmp)"
echo

ret+=$rx

echo "---"

#check for failing checksums
"${checksumTool}" -w -r -X "$md5Digest" . > $failTmp
ry="$?"
echo
echo "-- non-relative exit code for -x flag"
echo $ry
echo
echo "-- failTmp output"
echo
echo "$(< $failTmp)"
echo

ret+=$ry

echo "---"

# Change directory to temporary directory to read results...
cd $tmpDir

#Count number of Passed/Failed
numberPass=`wc -l $passTmp| cut -d" " -f1`
numberFail=`wc -l $failTmp| cut -d" " -f1`

#Create report
echo "PASSED" >> $reportTmp
cat $passTmp >> $reportTmp
echo " " >> $reportTmp
echo $numberPass "items passed integrity checking" >> $reportTmp
echo " " >> $reportTmp
echo " " >> $reportTmp
echo "FAILED" >> $reportTmp
cat $failTmp >> $reportTmp
echo " " >> $reportTmp
echo $numberFail "items failed integrity checking" >> $reportTmp

#copy pasta
cp $reportTmp "$integrityReport"
cat $failTmp 1>&2

#cleanup
rm $failTmp $passTmp $reportTmp

ret+="$numberFail"

exit ${ret}

It would seem that part of the problem lies between the use of relative and absolute paths. Looking at the man pages, it's not entirely clear why, but it doesn't seem that hashdeep can do much very clever using either method.

In the above script, I do manage to engineer a pass using relative paths (output (and comparison) using -l and amending ./ to each line. I match using:

"${checksumTool}" -w -l -r -M "$md5Digest" . > $passTmp

And I also see something more like the failure that you're seeing using:

"${checksumTool}" -w -r -M "$md5Digest" . > $passTmp

You can see the results in a simple transfer here:

^^^
-- Current directory listing:

file-one.a
file-two.b

^^^
-- Provided MD5 digest:

ebe44f549cb0406915a9239986f2c72f  ./file-one.a
ebe44f549cb0406915a9239986f2c72f  ./file-two.b

^^^^^

-- relative exit code for -m flag
0

-- passTmp output

ebe44f549cb0406915a9239986f2c72f  ./file-one.a matched ./file-one.a
ebe44f549cb0406915a9239986f2c72f  ./file-two.b matched ./file-two.b

-- relative exit code for -x flag
0

-- failTmp output

-- non-relative exit code for -m flag
1

-- passTmp output

ebe44f549cb0406915a9239986f2c72f  /var/archivematica/sharedDirectory/currentlyProcessing/md5-example-b7900671-fbae-4287-a2db-4ceed3afe57b/objects/file-two.b matched ./file-one.a
ebe44f549cb0406915a9239986f2c72f  /var/archivematica/sharedDirectory/currentlyProcessing/md5-example-b7900671-fbae-4287-a2db-4ceed3afe57b/objects/file-one.a matched ./file-one.a

---

-- non-relative exit code for -x flag
1

-- failTmp output

---
File Does not exist: /var/archivematica/sharedDirectory/currentlyProcessing/md5-example-b7900671-fbae-4287-a2db-4ceed3afe57b/metadata/checksum.sha1
File Does not exist: /var/archivematica/sharedDirectory/currentlyProcessing/md5-example-b7900671-fbae-4287-a2db-4ceed3afe57b/metadata/checksum.sha256

And a more complex transfer here:

^^^
-- Current directory listing:

beihai.tif
bird.mp3
ocr-image.png
piiTestDataCreditCardNumbers.txt
piiTestDataSocialSecurityNumbers.txt
View_from_lookout_over_Queenstown_towards_the_Remarkables_in_spring.jpg

^^^
-- Provided MD5 digest:

2121dca88ad7f701d3f3e2d041004a56  ./beihai.tif
2121dca88ad7f701d3f3e2d041004a56  ./bird.mp3
6dc1519418859ea5c20fd708e89d7254  ./ocr-image.png
75388a532283b988f79206d63f65e9a2  ./piiTestDataCreditCardNumbers.txt
1d7193ea3b2193c79f55ea7e645503a9  ./piiTestDataSocialSecurityNumbers.txt
4737e4dacfc9510915ea58cf12e51712  ./View_from_lookout_over_Queenstown_towards_the_Remarkables_in_spring.jpg

^^^^^

-- relative exit code for -m flag
0

-- passTmp output

1d7193ea3b2193c79f55ea7e645503a9  ./piiTestDataSocialSecurityNumbers.txt matched ./piiTestDataSocialSecurityNumbers.txt
75388a532283b988f79206d63f65e9a2  ./piiTestDataCreditCardNumbers.txt matched ./piiTestDataCreditCardNumbers.txt
6dc1519418859ea5c20fd708e89d7254  ./ocr-image.png matched ./ocr-image.png
4737e4dacfc9510915ea58cf12e51712  ./View_from_lookout_over_Queenstown_towards_the_Remarkables_in_spring.jpg matched ./View_from_lookout_over_Queenstown_towards_the_Remarkables_in_spring.jpg
2121dca88ad7f701d3f3e2d041004a56  ./bird.mp3 matched ./bird.mp3
2121dca88ad7f701d3f3e2d041004a56  ./beihai.tif matched ./beihai.tif

-- relative exit code for -x flag
0

-- failTmp output

-- non-relative exit code for -m flag
1

-- passTmp output

75388a532283b988f79206d63f65e9a2  /var/archivematica/sharedDirectory/currentlyProcessing/dupe-example-5f8a8094-9134-4333-a625-519bacf594d1/objects/piiTestDataCreditCardNumbers.txt matched ./piiTestDataCreditCardNumbers.txt
1d7193ea3b2193c79f55ea7e645503a9  /var/archivematica/sharedDirectory/currentlyProcessing/dupe-example-5f8a8094-9134-4333-a625-519bacf594d1/objects/piiTestDataSocialSecurityNumbers.txt matched ./piiTestDataSocialSecurityNumbers.txt
6dc1519418859ea5c20fd708e89d7254  /var/archivematica/sharedDirectory/currentlyProcessing/dupe-example-5f8a8094-9134-4333-a625-519bacf594d1/objects/ocr-image.png matched ./ocr-image.png
4737e4dacfc9510915ea58cf12e51712  /var/archivematica/sharedDirectory/currentlyProcessing/dupe-example-5f8a8094-9134-4333-a625-519bacf594d1/objects/View_from_lookout_over_Queenstown_towards_the_Remarkables_in_spring.jpg matched ./View_from_lookout_over_Queenstown_towards_the_Remarkables_in_spring.jpg
2121dca88ad7f701d3f3e2d041004a56  /var/archivematica/sharedDirectory/currentlyProcessing/dupe-example-5f8a8094-9134-4333-a625-519bacf594d1/objects/beihai.tif matched ./beihai.tif
2121dca88ad7f701d3f3e2d041004a56  /var/archivematica/sharedDirectory/currentlyProcessing/dupe-example-5f8a8094-9134-4333-a625-519bacf594d1/objects/bird.mp3 matched ./beihai.tif

---

-- non-relative exit code for -x flag
1

-- failTmp output

---
File Does not exist: /var/archivematica/sharedDirectory/currentlyProcessing/dupe-example-5f8a8094-9134-4333-a625-519bacf594d1/metadata/checksum.sha1
File Does not exist: /var/archivematica/sharedDirectory/currentlyProcessing/dupe-example-5f8a8094-9134-4333-a625-519bacf594d1/metadata/checksum.sha256

The -w flag is nice as it shows what matched with what. E.g.

ebe44f549cb0406915a9239986f2c72f  ./file-one.a matched ./file-one.a
ebe44f549cb0406915a9239986f2c72f  ./file-two.b matched ./file-two.b

There doesn't seem to be a sensible workaround for users here. It looks like you are already using -l to create a manifest. The issue itself is confirmed with the three algorithms (md5, sha1, sha256). I can't see other issues with the mechanism.

NB. With the two calls to "${checksumTool} ..." I wonder if this is redundancy we don't really need in a script performing this function?

andrewjbtw commented 5 years ago

The behavior of md5deep when checking hashes has always seemed a bit puzzling to me, but I've never investigated it in depth until now. It looks to me that the big issue here is that md5deep isn't verifying hashes from the checksum manifest, it's just looking for matches between the files it's being run on and the lists of hashes it's being fed as input.

That's a bit of a fine distinction but what I mean by that is the following:

Verifying hashes from a manifest: given a manifest containing a list of hashes and paths to files, check that the file at each path has the same hash as the one given on the same line in the manifest.

Looking for matches: given lists of hashes, check a set of files to see which files match which hashes.

What's confusing is that md5deep's matching mode often looks like it's verifying hashes from a manifest, when what I think it's really doing is simply identifying matches for given input hashes. Sometimes this results in what appear to be successful tests, but other times it leads to outputs like the ones identified in this issue.

Consider the following examples:

Example 1: duplicate file exists in folder but is not listed on checksum.md5 manifest.

Create a folder and populate it with some files. Then create a manifest.


$ mkdir objects
$ for i in {a..d} ; do echo "$i" > objects/"$i".txt ; done
$ md5deep -rl objects/ > checksum.md5

Your checksum manifest should look like this now:


$ cat checksum.md5
60b725f10c9c85c70d97880dfe8191b3  objects/a.txt
3b5d5c3712955042212316173ccf37be  objects/b.txt
2cd6ee2c70b0bde53fbe6cac3c8b8bb1  objects/c.txt
e29311f6f1bf1af907f9ef9f44b8328b  objects/d.txt

Now create a duplicate file within the objects directory, but don't put it on the manifest.


$ cp objects/a.txt objects/not-on-manifest-duplicate.txt

Now run md5deep in matching mode and check the exit code.


$ md5deep -w -l -r -M checksum.md5 objects
3b5d5c3712955042212316173ccf37be  objects/b.txt matched objects/b.txt
60b725f10c9c85c70d97880dfe8191b3  objects/a.txt matched objects/a.txt
2cd6ee2c70b0bde53fbe6cac3c8b8bb1  objects/c.txt matched objects/c.txt
e29311f6f1bf1af907f9ef9f44b8328b  objects/d.txt matched objects/d.txt
60b725f10c9c85c70d97880dfe8191b3  objects/not-on-manifest-duplicate.txt matched objects/a.txt
$ echo $?
0

Despite one file not being on the manifest, every file has a match in the list of hashes and the exit code is 0. The "not-on-manifest-duplicate.txt" file matches the hash for "a.txt", so md5deep considers the result ok. There's no indication that the objects folder contains a file not on the manifest.

What about negative matching mode?


$ md5deep -w -l -r -X checksum.md5 objects
$ echo $?
0

No files or hashes are lacking matches and the exit code is again 0.

This result suggests that there are situations where a transfer with duplicate files not on the checksum manifest would still pass the verify checksums microservice. I've run packages resembling this structure through Archivematica 1.8 and they've passed.

Example 2: duplicate file exists in folder and is listed on checksum.md5 manifest using relative paths (the actual scenario in this issue)

Take the same folder and rename the "not-on-manifest-duplicate.txt" file to "on-manifest-duplicate.txt", then re-create the manifest:


$ mv objects/not-on-manifest-duplicate.txt objects/on-manifest-duplicate.txt
$ md5deep -rl objects/ > checksum.md5
$ cat checksum.md5
60b725f10c9c85c70d97880dfe8191b3  objects/a.txt
3b5d5c3712955042212316173ccf37be  objects/b.txt
2cd6ee2c70b0bde53fbe6cac3c8b8bb1  objects/c.txt
e29311f6f1bf1af907f9ef9f44b8328b  objects/d.txt
60b725f10c9c85c70d97880dfe8191b3  objects/on-manifest-duplicate.txt

Now run md5deep in matching mode using relative paths and check the exit code:


$ md5deep -w -l -r -M checksum.md5 objects
3b5d5c3712955042212316173ccf37be  objects/b.txt matched objects/b.txt
60b725f10c9c85c70d97880dfe8191b3  objects/a.txt matched objects/a.txt
e29311f6f1bf1af907f9ef9f44b8328b  objects/d.txt matched objects/d.txt
60b725f10c9c85c70d97880dfe8191b3  objects/on-manifest-duplicate.txt matched objects/on-manifest-duplicate.txt
2cd6ee2c70b0bde53fbe6cac3c8b8bb1  objects/c.txt matched objects/c.txt
$ echo $?
0

Check negative matching mode.


$ md5deep -w -l -r -X checksum.md5 objects
$ echo $?
0

In this case everything looks good. Every line in checksum.md5 matches every file in objects. This is back to the puzzle we started with: why does md5deep match every file (including duplicates) and exit with 0 sometimes, but match every file and exit with 1 at other times. I think examples 3 and 4 will help explain it.

Example 3: duplicate file exists in folder and is listed on checksum.md5 manifest using absolute paths

Regenerate checksum.md5, but use absolute paths.


$ md5deep -r objects/ > checksum.md5
$ cat checksum.md5
60b725f10c9c85c70d97880dfe8191b3  /temp/hashtest/objects/a.txt
3b5d5c3712955042212316173ccf37be  /temp/hashtest/objects/b.txt
2cd6ee2c70b0bde53fbe6cac3c8b8bb1  /temp/hashtest/objects/c.txt
e29311f6f1bf1af907f9ef9f44b8328b  /temp/hashtest/objects/d.txt
60b725f10c9c85c70d97880dfe8191b3  /temp/hashtest/objects/on-manifest-duplicate.txt

Now run matching mode again and check the exit code.


$ md5deep -w -l -r -M checksum.md5 objects
60b725f10c9c85c70d97880dfe8191b3  objects/a.txt matched /temp/hashtest/objects/a.txt
3b5d5c3712955042212316173ccf37be  objects/b.txt matched /temp/hashtest/objects/b.txt
e29311f6f1bf1af907f9ef9f44b8328b  objects/d.txt matched /temp/hashtest/objects/d.txt
60b725f10c9c85c70d97880dfe8191b3  objects/on-manifest-duplicate.txt matched /temp/hashtest/objects/a.txt
2cd6ee2c70b0bde53fbe6cac3c8b8bb1  objects/c.txt matched /temp/hashtest/objects/c.txt
$ echo $?
1

md5deep found a match for every file but the exit code is 1. Why? In the md5deep man page, an exit code of 1 is defined as:

Unused hashes. Under any of the matching modes, returns this value if one or more of the known hashes was not matched by any of the input files.

Looking at the output above, both "a.txt" and "on-manifest-duplicate.txt" are listed as having matched the hash for "a.txt". This leaves the duplicate hash for "on-manifest-duplicate.txt" unmatched. I believe that's why the exit code is 1. I think this also explains what's happening in the other examples above where every file has a match but the exit code is 1. Note that in Ross's example for absolute paths, there are two matches for "behai.tif" and none for "bird.mp3".

But why doesn't md5deep match the files one-by-one down the list instead of matching more than one file to the same given hash? I haven't examined the underlying code, but it looks like md5deep's order of processing is highly sensitive to both the format of the input list and whether it's set to use relative or absolute paths when looking for matches.

If you re-run matching mode with absolute paths (i.e. without the '-l' option), you get a different result:


$ md5deep -w -r -M checksum.md5 objects
60b725f10c9c85c70d97880dfe8191b3  /temp/hashtest/objects/a.txt matched /temp/hashtest/objects/a.txt
3b5d5c3712955042212316173ccf37be  /temp/hashtest/objects/b.txt matched /temp/hashtest/objects/b.txt
2cd6ee2c70b0bde53fbe6cac3c8b8bb1  /temp/hashtest/objects/c.txt matched /temp/hashtest/objects/c.txt
e29311f6f1bf1af907f9ef9f44b8328b  /temp/hashtest/objects/d.txt matched /temp/hashtest/objects/d.txt
60b725f10c9c85c70d97880dfe8191b3  /temp/hashtest/objects/on-manifest-duplicate.txt matched /temp/hashtest/objects/on-manifest-duplicate.txt
$ echo $?
0

In this case, there's a one-to-one correspondence between hashes and the files they match. So all hashes in the list are "consumed" in the matching process and the exit code is 0.

Negative matching mode with absolute paths:

$ md5deep -w -r -X checksum.md5 objects
$ echo $?
0

This test passes too.

So it looks like if you use absolute paths in matching mode with an absolute path manifest, you can engineer a pass. Same with relative paths in matching mode and a relative path manifest, as seen above in Example 2 and in Ross's relative path example above. In Archivematica, you pretty much have to use relative paths and relative matching to get a pass, as it's very unlikely that you'll be ingesting files that start with an absolute path of "/var/archivematica/".

But if the manifest paths don't match up with the paths on Archivematica, whether it's a mismatch between relative and absolute paths, or perhaps because the manifest paths don't actually exist (more on this below), then you could end up with an outcome where md5deep matches the same hash to multiple files and then leaves a duplicate of that hash unmatched, leading to the exit code of 1.

Again, this suggests that md5deep's primary function in matching modes is to match hashes, not to match specific file paths to specific hashes. I've found that by varying just the order of the lines in checksum.md5, I can get md5deep to report a match on one or the other of duplicate files, in cases where one duplicate is matched and the other is not. Putting the "on-manifest-duplicate.txt" line in checksum.md5 results in it being the file matched twice, not "a.txt", if I re-run the test where I mix relative and absolute path settings.

I don't think this would happen if md5deep were actually parsing the paths in checksum.md5 before checking file hashes.

Example 4: checksum.md5 lists only hashes, not hashes and file paths

An extreme test case is a checksum.md5 that lists only hashes without file paths.

Take a checksum.md5 from an earlier example with duplicate hashes and remove the filepaths, leaving each line with just an md5 hash plus a trailing space (md5deep complains if you don't leave a space).


$ cut -c -33 checksum.md5 > tmp # cut the first 32 characters for the checksum plus 1 more for the space
$ mv tmp checksum.md5
$ cat checksum.md5
0b725f10c9c85c70d97880dfe8191b3 
3b5d5c3712955042212316173ccf37be 
2cd6ee2c70b0bde53fbe6cac3c8b8bb1 
e29311f6f1bf1af907f9ef9f44b8328b 
60b725f10c9c85c70d97880dfe8191b3

Run md5deep in matching mode and check the exit code.


$ md5deep -w -r -l -M checksum.md5 objects
60b725f10c9c85c70d97880dfe8191b3  objects/a.txt matched
3b5d5c3712955042212316173ccf37be  objects/b.txt matched
2cd6ee2c70b0bde53fbe6cac3c8b8bb1  objects/c.txt matched
e29311f6f1bf1af907f9ef9f44b8328b  objects/d.txt matched
60b725f10c9c85c70d97880dfe8191b3  objects/on-manifest-duplicate.txt matched
$ echo $?
1

Each file has a match, but the exit is 1. I think this is because one of the duplicate hashes is matched twice, while the other is left unmatched. It's difficult to tell when checksum.md5 has only hashes.

Negative matching mode shows there aren't any unmatched files, but the exit code is still 1.


$ md5deep -w -r -X checksum.md5 objects
$ echo $?
1

Again, I think this is the result of the duplicate hash not being matched.

Finally, take out the duplicate hash from checksum.md5 and run the tests again.


$ cat checksum.md5
60b725f10c9c85c70d97880dfe8191b3 
3b5d5c3712955042212316173ccf37be 
2cd6ee2c70b0bde53fbe6cac3c8b8bb1 
e29311f6f1bf1af907f9ef9f44b8328b 
$ md5deep -w -r -l -M checksum.md5 objects
60b725f10c9c85c70d97880dfe8191b3  objects/a.txt matched 
3b5d5c3712955042212316173ccf37be  objects/b.txt matched 
2cd6ee2c70b0bde53fbe6cac3c8b8bb1  objects/c.txt matched 
e29311f6f1bf1af907f9ef9f44b8328b  objects/d.txt matched 
60b725f10c9c85c70d97880dfe8191b3  objects/on-manifest-duplicate.txt matched 
$ echo $?
0
$ md5deep -w -r -l -X checksum.md5 objects
$ echo $?
0

In this last case, with only unique hashes and no file paths, all of the tests pass.

In the end, it looks like file paths are not required by md5deep for it to result in an exit code of 0 in either of the matching modes. This suggests that md5deep is not really verifying checksum.md5 as a manifest at all, and that it may not be the right tool for this job.

It's true that if you make a checksum.md5 that uses relative paths and run md5deep with relative paths in matching mode, the tests will probably pass. But once you introduce duplicate hashes the behavior of md5deep becomes less predictable, and there are scenarios where it won't catch differences between what's listed in the manifest and the arrangement of the files that exist on disk. In example 1, it didn't catch a duplicate file not listed on the checksum.md5 manifest, for instance. There are other more complex arrangements of duplicates that won't be caught, but this comment is long enough that I won't get into them.

My suggestion is that if Archivematica is going to verify transfer checksums but not require use of the BagIt standard, then it is safer to use 'md5sum -c checksum.md5', as suggested by Matthew Addis above. That command will check each line in the checksum.md5 file and make sure each hash matches at each path. However, it will not detect files that aren't on the manifest, so you'd need a second check to make sure that (number of files) = (number of lines in checksum.md5).

Something along the lines of:

Step one:

Count files with 'find', like


$ countfiles=$(find objects/ -type f | wc -l)

Step 2: count the number of lines in checksum.md5


$ checksum_lines=$(cat checksum.md5 | wc -l)

Step 3

Check if "$countfiles" == "$checksum_lines"


$ if [ "$countfiles" -eq "$checksum_lines" ] ; then ... ; else ... ; fi

Step 4

If the counts match, then check the manifest.


$ md5sum -c checksum.md5

However, there may be other failure cases to check for that I haven't thought of that can still get by these tests.

Side note: I think the behavior of md5deep is one of the cases where the needs of digital forensic investigators does not match the needs of digital archivists using digital forensics tools.

For digital archivists, the main use case for lists of files and their hashes is likely to be: "I want to know whether the files on this manifest have changed since someone first computed their hashes."

For a forensic investigator using md5deep, the use case seems to be: "I want to know which files in this new piece of evidence match files I've seen before ("known hashes") and which do not, so I know where to focus my investigative efforts."

Under the latter scenario, the investigator is likely to be using lists of unique hashes, as it's only necessary to list a hash once if you're just looking for files that match it. This ultimately may be why duplicate hashes in checksum.md5 can cause problems. That usage may not have been considered in scope when designing the tool.

sallain commented 5 years ago

Just a note to say that a user came across this issue independent of this conversation (funny how that always happens!) and one concerning factor is that Archivematica provides absolutely no indication in the UI that this is what's causing the issue. Not sure if there's something in the client scripts that could be passed to the stderr/stdout to improve this so that users have a hope of diagnosing the issue.

mjaddis commented 5 years ago

Great discussion of the issues. In terms of checking a manifest against the payload of a transfer, 'md5sum -c' does the job very well and could replace md5deep in my view. I think the question is what to do about files that aren't in the manifest. Should the transfer fail if there are files not in the manifest, should the extra files be ignored, or should the extra files be physically deleted?

I would be happy with the following:

If I provide a checksum.md5 in a Transfer then I want each and every entry in the manifest to be checked against the payload of the Transfer so I know that my specified list of files are all present and correct (name, path, checksum). If any entries in the manifest fail to be verified then I want the whole Transfer to fail.
If there are files in the Transfer that are not listed in the checksum.md5 then I either want them to be ignored or to be deleted. They are not relevant to my Transfer. However, I don't want the Transfer to fail.

The behaviour above is a slightly different to the way it is designed currently. If there are files in a Transfer that don't match the manifest then the Transfer is failed. Personally, I'd be happy if this was relaxed a little. It's all to easy for users to pollute a Transfer with .DS_Store, Thumbs.db and other junk files, e.g. simply by using Finder or Explorer to 'look inside' a Transfer before copying it to Archivematica. I can't remember what Archivematica's policy is for handling junk files, e.g. removing them, but if there are files like this in a Transfer and they are not listed in a checksum.md5 (or maybe they are listed but have changed), then I'd prefer them to be ignored rather than cause a Transfer to fail.

ross-spencer commented 5 years ago

Such a great analysis @andrewjbtw, thank you!

Speaking as independently as possible, I think more in response to @mjaddis I'd like to understand how best to handle the 'files not on the manifest we're checking against use-case'. I'd love to know both what was and wasn't there, and have control over deletion or not, and possibly, failure or not too, because at some point in the workflow everything gets taken over by Bagit's conventions. Of course, it starts to make the solution more complex, i.e. from a tool/script change, to database-workflow migrations... so there's that!

ross-spencer commented 5 years ago

Marking as Severity: High for discussion about inclusion in 1.9.

ross-spencer commented 5 years ago

I need to incorporate this with the AM codebase, but here is a gist beginning to approximate the way this might work as a shell script: https://gist.github.com/ross-spencer/4bd776a221a26c71ed0d9ee96bc12a34

Example output:

Comparing transfer checksums with the sha256 file
Comparison failed with 3 checksum lines and 4 transfer files
Comparing transfer checksums with the md5 file
transfer/objects/data_four.txt: FAILED
transfer/objects/data_two.txt: FAILED
Nothing to do for sha512: File 'checksum.sha512' not provided 
Comparing transfer checksums with the sha1 file
sha1sum: WARNING: 1 line is improperly formatted
Exiting with code: 3

ross-spencer commented 5 years ago

archivematica / Issues

Problem: md5deep checksum validation for multiple files with same checksum #346