eprintsug / EPrintsArchivematica

Digital Preservation through EPrints-Archivematica Integration - An EPrints export plugin to Archivematica
6 stars 1 forks source link

missing checksums in EPrints? #5

Closed photomedia closed 5 years ago

photomedia commented 5 years ago

Q2.1: What would we do for any files that may not have had a checksum generated and stored in the EPrints database? The spec has EPrints generating a checksum for all files in the export. Q2.2: Which of the files in the export would be in that situation (not having a checksum stored in EPrints)?

goetzk commented 5 years ago

This is a reasonably complex question as a given eprint file object could have any combination of no checksum, md5, sha something (its configurable in eprints).

So we can have a checksum that isn't supported by archivematica (eg sha512) in eprints.

To an extent it depends on what archivematica will do if a checksum line is missing , esp if only missing from one checksum validation file (https://www.archivematica.org/en/docs/archivematica-1.8/user-manual/transfer/transfer/#create-a-transfer-with-existing-checksums notes multiple checksum files are supported but not if they are supported in the same AIP) or missing from multiple. I assume it will fail, that seems the safest option.

Assuming it will fail on a missing checksum the only options we really have are skipping validation if missing checksums, refusing to export an AIP, or generating checksums for anything missing one.

I've just opened #7 for the related issue of checksums existing in eprints but not matching the files on disk.

goetzk commented 5 years ago

Because I didn't actually put my thought in - I believe we should regenerate all checksums when exporting , which would catch both this and #7. it may not be necessary on recently uploaded files, but for anything older a last check before archiving seems a sensible thing to me.

tw4l commented 5 years ago

We should verify this, but my understanding is that Archivematica expects a single manifest file - either checksum.md5, checksum.sha1, or checksum.sha256. Either way, I think we'd be best off avoiding splitting the manifest across several different files and algorithms.

This seems like a good candidate for applying Postel's law ("Be conservative in what you send; be liberal in what you accept."). Meaning, existing checksums of whichever algorithm should be used for verification, but the script should regenerate checksums when they are not available in the preferred algorithm for the manifest file. However, when an existing hash value using the right algorithm already exists, we should seek to reuse that value in the manifest that will get passed to Archivematica rather than discarding and regenerating the checksum, to reduce the chance that something happens to the file prior to the new checksum values being generated.

goetzk commented 5 years ago

I've just done some testing, key take aways:

photomedia commented 5 years ago

Since tests reveal that it is reasonable to expect EPrints to have MD5 checksums, we should focus on using this as the preferred algorithm for the manifest.
Q1: when a file is missing a checksum in EPrints, we regenerate an MD5 checksum on export to AIP. Q2: which files may not have a checksum? Testing reveals that some files fail to have a checksum in EPrints, it is not consistent. Some derivative files such as thumbnails do not have them. The export plugin should generate MD5 checksums for all files that do not have them.