eprintsug / EPrintsArchivematica

Digital Preservation through EPrints-Archivematica Integration - An EPrints export plugin to Archivematica
6 stars 1 forks source link

Should all checksums be checked before generating an AIP #7

Closed goetzk closed 5 years ago

goetzk commented 5 years ago

In #5 there is discussion around what happens if a checksum is missing, here I'd like to cover what happens if it is wrong, and will we be checking.

I think its worth this standard mentioning (even if its optional) that checksums from the DB should be checked against the files on disk before they are packaged for export. After all, export seems like the correct time to validate the files we're exporting still match what the database thinks they are.

If the checksums don't match we may have a file corruption problem which will need to be investigated.

tw4l commented 5 years ago

I agree with @goetzk that existing checksums should be validated before export. We will need a check post-export as well, as the file transfer is probably the most likely culprit for a bit flip, but this can likely be handled by Archivematica using the supplied checksum manifest.

photomedia commented 5 years ago

I have added a note to the README , which says that existing MD5 checksums should be rechecked, signalling an error if they fail to match. However, it is not clear what happens after that? Aside from signalling an error (how exactly should the error appear?) should the eprint be exported anyway with the regenerated MD5?

goetzk commented 5 years ago

I believe the export should stop and the admin intervene. The main reason is that a checksum mismatch could indicate a problem and hiding the problem by making a new checksum will cause future problems.

photomedia commented 5 years ago

I agree that in case of an eprint with a mismatch of checksum on recheck should prevent that eprint from being exported, with an error recorded for this eprint. If this error comes up for a single eprint during a batch preservation export of all eprints, should we just exclude the problematic one with an error, but continue with the whole batch operation? I think that should be an option (either skip-proceed on error, or halt-on-error) in the configuration. What do you think?

mpbraendle commented 5 years ago

I think archival practice is that the whole batch should be rejected - there may be dependencies between the individual parts and if one part fails the SIP or AIP may be incomplete. Anyway, this is something that should already be handled on SIP generation.

tw4l commented 5 years ago

Given how we're planning to use this integration at Concordia (where there is a direct and 1:1 relation between eprint->SIP->AIP, with no dependencies between SIPs), I would prefer to default to failing only the eprint with a checksum mismatch, not cancelling the entire batch job. However, as there are clearly different use cases/thinking about this, I would be in favor of including a configuration option to instead cancel the whole batch.

It might also be worthwhile to add in an email notification about the failure, to quickly flag the issue for administrators.

photomedia commented 5 years ago

I agree @timothyryanwalsh , let's have this as a config option, otherwise, one single eprint failure would prevent batch jobs from completing until all errors are resolved. We will have the config option default to skip-proceed on error. The email-on-error notification is a good feature, I will add that as well to the spec.

goetzk commented 5 years ago

Sounds good!