eprintsug / EPrintsArchivematica

Digital Preservation through EPrints-Archivematica Integration - An EPrints export plugin to Archivematica
6 stars 1 forks source link

checksum mismatch vs misssing #32

Open photomedia opened 3 years ago

photomedia commented 3 years ago

A checksum MISMATCH should only occur when there is an existing checksum in the EPrints database for a file, and it doesn't match what is being checked. In the case of a MISMATCH, what the system does should be controlled by this option. From documentation:

$c->{DPExport}={on-checksum-mismatch}=skip-proceed|halt

skip-proceed should be the default, meaning that the problematic eprint is flagged with an error in the eprint's digital preservation errors field, but the batch job continues. If 'halt' is chosen, the entire batch job that the problematic eprint is a part of halts.

This option needs to be implemented, it is still not there in the code.

However, MISMATCH is not the same as a MISSING checksum in the EPrints database for a file/document. In this case, the system should do the following (from documentation):

For files with no MD5 value in the EPrints database:

**Ensure that the file is actually part of this eprint**
Generate a new MD5 from the file on disk
Write the MD5 to the EPrints database
Write the MD5 to the checksum.md5 manifest
Note that the MD5 was generated for the given file in the eprints' digital preservation warnings field

Relevant code is here, it needs to distinguish the two cases of MISSING vs MISMATCH: https://github.com/eprintsug/EPrintsArchivematica/blob/9d5c1cc7c1010e44477d9f8ff2accb53192ca8a9/lib/plugins/EPrints/Plugin/Export/Archivematica/EPrint.pm#L341

UPDATE: for files with no MD5 in the EPrints database, there is also a THIRD possibility of an error, which I did encounter: that of a pre-existing file in the "objects" directory of the export folder which doesn't belong with the EPrints that is currently being exported. That is because the current Eprint export algorithm doesn't delete the objects folder before writing to it, so a previous export's file could end up in the objects folder. In this case, the file would not have a corresponding hash in the database either. I am adding to the "no checksum in the database" error above "check that the file belongs with this eprint"

photomedia commented 3 years ago

To summarize:

We need to review the logic around the error throwing on checksum mismatch vs checksum missing. The missing checksum should be logged as such, and should by default be generated and added to the EPrints database - that's IF the file is actually a part of the eprint and not some left-over file from a previous export. Checksum mismatch should still skip-proceed by default, but checksum missing is a log message that a missing checksum for a file was generated/added. This (on-checksum-missing) could also be controlled by a flag in the config to skip-proceed|halt|generate, with generate being the default.

photomedia commented 3 years ago

I added some code to differentiate between the two issues: checksum MISMATCH vs MISSING. In case of MISSING, a checksum is added to the file in EPrints database and processing continues.

https://github.com/eprintsug/EPrintsArchivematica/commit/281ef81724ad455a6710c0abebe0290640d38994

photomedia commented 3 years ago

The missing checksums is now resolved with the following commit: https://github.com/eprintsug/EPrintsArchivematica/commit/c754a3e061f532578aa704ce8a96c9f9bcc8a15a I updated the README with this as well, including new "add_missing_checksums" configuration option.

The leftover to-do item is just to control what happens in case of checksum-mismatch: $c->{DPExport}={on-checksum-mismatch}=skip-proceed|halt