APTrust / dart

Create bags based on BagIt profiles and send them off into the ether (EasyStore is now DART)
BSD 2-Clause "Simplified" License
46 stars 3 forks source link

False validation error: invalid checksum #468

Open diamondap opened 3 years ago

diamondap commented 3 years ago

From PTSEM:

I recently uploaded a few dozen objects to our production repo using DART at the command line. Most completed successfully, but for one of them, DART returned an error message:

error: validate/completed - Operation completed with errors. Bad md5 digest for 'data/0056.mets.xml': manifest says 'c6daf78cdbb8129c59b0c672', file digest is 'f89314b6c6daf78cdbb8129c59b0c672'.

The odd thing about this is that the manifest doesn't actually say "c6da..." at all. It has "f893...":

f89314b6c6daf78cdbb8129c59b0c672 data/0056.mets.xml

I expanded the tar file and ran md5 on that file and got this:

MD5 (.dart/bags/ptsem.edu.theocom.0056/data/0056.mets.xml) = f89314b6c6daf78cdbb8129c59b0c672

I can't tell where "c6da..." is coming from. Any ideas?

diamondap commented 3 years ago

Looks like it's reading the manifest incorrectly, skipping over the first few bytes. I wonder if this is a bug in the tar stream library.

The actual file md5 and the manifest md5 match. Both are "f89314b6c6daf78cdbb8129c59b0c672".

However, the error message reports the manifest md5 as "c6daf78cdbb8129c59b0c672", which omits the first 8 bytes.

f89314b6c6daf78cdbb8129c59b0c672 --------c6daf78cdbb8129c59b0c672

The bag in question has the following payload, amounting to 946 MB:

f89314b6c6daf78cdbb8129c59b0c672 data/0056.mets.xml 655211972a23f666b8533bbadab3c311 data/0056.mods.xml 3612e74091b8566cde10793744577a24 data/0056_archival_master_a.wav e23d1aa6b4a1a1ede702fb7d6af1daec data/0056_access.mp3 393c16e6ed6bad4b92ed90ef8eb8bf3c data/0056.xml dfcd367d4c3ab5c77937322fc9be7d0b data/0056_archival_master_AssetFront.JPG

The md5 manifest is added at the end of the bagging process, which means it's preceed by the files in this list in the tar archive. Is there something off in the tar headers to make them start reading the md5 manifest at byte 8 instead of byte zero?

diamondap commented 3 years ago

From Greg at PTSEM:

I tried using DART on a different computer to package and upload the same files. That worked. My laptop has version 2.0.11.1795 whereas my desktop, where the failure occurred, has 2.0.11.1925. That difference may be irrelevant -- just letting you know.

diamondap commented 3 years ago

From Greg:

I've uploaded dozens of objects successfully using DART at the command line, but I encountered a second error message -- which is similar but not identical to the one we corresponded about yesterday:

error: validate/completed - Operation completed with errors. Payload file data/01103.mets.xml not found in manifest-md5.txt

Actually that file is listed on the first line of manifest-md5.txt. So whereas yesterday the problem seemed to be skipping the first several bytes of manifest-md5.txt, today it seems to be skipping the first line. I've attached the manifest-md5.txt file and the log lines pertaining to this object.

If you'd like me to try anything in particular, let me know. Otherwise I'll try it with DART on my laptop computer, as I did yesterday successfully.

manifest-md5.txt

ff9df139371d90c0c28b73a6eda6f78d data/01103.mets.xml 7815e889302b28a354bde2b28b3f4be3 data/01103.mods.xml 6d89ec8b70cec1ed3598de8129d07f5d data/01103_archival_master.wav 5292e30e9aab1064e9eb2d951ab66cc3 data/01103_access.mp3 0ff840182c7cc384fe69e141db788d8d data/01103.xml e2108bc4f8d5860807eec7d20bd25f15 data/01103_BoxFront.JPG 786433938aa46d775b47177ad8b9d21d data/01103_ReelFront.JPG dd7dbb7763339f36e6fa3ab03df8bbd6 data/01103.full-text.xml

diamondap commented 3 years ago

Specs on the two DART versions. Note that the validation fails on the desktop machine with the newer version of Node, but it works on the laptop with the older version.

Desktop - (fails) macOS 10.15.7 DART 2.0.11 with Node.js v12.18.3 for darwin-x64-19.6.0.

Laptop - (succeeds) macOS 10.15.7 DART 2.0.11 with Node.js v12.13.0 for darwin-x64-19.6.0.

diamondap commented 3 years ago

This error occurs in two versions of DART on two different Macs. DART v2.0.11.1795 with Node.js 12.13.0 and DART v2.0.11.1925 with Node.js 12.18.3.

diamondap commented 3 years ago

The bags in which these errors occur do not contain files over 8GB, so this issue is not related to the tar-stream library's occasional corruption of tarballs containing files >8GB.