eprintsug / EPrintsArchivematica

Digital Preservation through EPrints-Archivematica Integration - An EPrints export plugin to Archivematica
6 stars 1 forks source link

MD5 hashes #14

Closed photomedia closed 3 years ago

photomedia commented 4 years ago

MD5 hashes: So far it looks like this module is generating a new MD5 hash for each file exported. Where possible, we'd prefer to use the existing stored MD5 from the database, and only generate new hashes when there is not a recorded value. More information on the logic that should be used is in the README. The idea would be to verify that the fixity for a file on export is the same that it was when the file was uploaded to the eprints repository.

photomedia commented 4 years ago

@jb4, thank you for this latest commit - the code in that file is really starting to take shape. However, I cannot close the issue because the specifications do not say to calculate MD5 only if not already in the database. The specs say this:

"For these values already recorded in EPrints database, they should be checked (ie., recalculated for the file and compared to what is stored in EPrints) signalling an error if there is a mismatch. These errors indicate that file corruption may have already taken place. There should be a configuration option to control what happens in case of a checksum mismatch" https://github.com/eprintsug/EPrintsArchivematica/blob/master/README.md

So either way, the MD5 is calculated, it's just that if there is already a checksum in the database, it needs to be checked to confirm if it is still valid. This is an important requirement.

jb4 commented 4 years ago

I noticed that too, shortly after I committed it, d'oh. I'll make a change to compare the two checksums, and add config options a little later.

tw4l commented 4 years ago

Hi @jb4, thanks, this is coming along nicely! Tomasz and I just had a chat about what we'd want to happen when the mismatch is noted and here's what we're thinking:

By default:

If "force" flag is enabled:

"preservation_note": "Checksum mismatch in EPrints on filename [filename], detected on [date]. Original md5 value [(md5 value stored in database]) overwritten with new value calculated from file on disk ([new md5 value calculated from file on disk])"

Does that seem feasible to you?

photomedia commented 3 years ago

Because the metadata folder exported by the plugin had a typo in it (see: https://github.com/eprintsug/EPrintsArchivematica/commit/2d11e65775db177d8e39b2258890786e59841dc0), errors were not caught on transfer. Now that I fixed this, I am getting an error on transfer for the MD5 hashes. I always get the following error:

Microservice: Verify transfer checksums Job Verify metadata directory checksums Failed

Comparing transfer checksums with the supplied md5 file md5: comparison exited with status: 1. Please check the formatting of the checksums or integrity of the files.

I fixed the separators between the hashes and the file names to be two spaces (it was only one space originally), but that doesn't seem to solve the issue. Any idea what could be going wrong with the formatting of the MD5 file? Why does Archivematica reject it/can't read from it? I have another example of an MD5 file that it accepts, and it seems the same format.

photomedia commented 3 years ago

OK, I solved the checksum MD5 file incompatibility issue I described above which was giving "FAILED open or read" of MD5 paths during transfer . The issue is solved with this commit: https://github.com/eprintsug/EPrintsArchivematica/commit/75bb6bf973076c336e87d166034cd953240ad9f6 Ensuring MD5 file passes the checksum check on transfer to Archivematica requires relative paths and two spaces between the checksum and file path.