artefactual-sdps / enduro

A tool to support ingest and automation in digital preservation workflows
https://enduro.readthedocs.io/
Apache License 2.0
4 stars 3 forks source link

Problem: transfer names are not checked for duplicates before ingest #851

Open sallain opened 5 months ago

sallain commented 5 months ago

Is your feature request related to a problem? Please describe.

One of Artefactual's clients creates transfer packages with a consistent naming structure. The client creates many packages at a time and uploads them to Enduro en masse. Because their package creation and management method is relatively manual, there is a chance that human error will result in the same package being uploaded more than once, either within the same upload or over the course of any number of uploads, resulting in two AIPs with the same name. There are two possible reasons for a package to have the same name as a previous package: either it contains the same material and therefore has the same identifiers (which are used for the package name), or it is an error by the person creating the package. In either case, the user should be notified that an AIP with that name already exists in storage.

The need to have unique package names can also be extrapolated out to best practices - having two packages with the same name hinders searchability and could be considered a preservation risk, regardless of whether or not the contents are identical. Even though Archivematica/a3m's use of UUIDs prevents file naming collisions, users should be able to ensure that the human-readable or contextually significant part of the package name is also unique.

Describe the solution you'd like

Implement a check the compares the name of the new package to all packages that have been previously processed by the Enduro instance. If a duplicate name is detected, the user should be notified and the package should not be sent for ingest.

The check should be able to work across multiple transfer source locations.

Describe alternatives you've considered

None

Additional context

Legacy Enduro has implemented such a check, but I think there's a chance that it ONLY looks at the contents of a given batch, rather than across the full history of transfers. This implementation, I believe, doesn't look at the AIP store for transfer names. How this feature will work for an Enduro instance that is already in use is an open question - would love to hear opinions about how far back, if at all, the check should be looking.

Note that the desired solution doesn't suggest that we look for duplicate materials; that is, it doesn't need to see if the same image or video has already been preserved. In my opinion, that's a separate (and potentially more complicated) feature. This feature is just for transfer names.

djjuhasz commented 4 months ago

Here's where duplicate transfer name check functionality was added to artefactual-labs/enduro: https://github.com/artefactual-labs/enduro/pull/548/files

The user manual description of the rejectDuplicates option provides a good summary of how the check works:

rejectDuplicates (Boolean)

When enabled, the workflow will execute a check on the internal database for successfully completed transfers with the same transfer name as the currently processing package. If it finds a duplicate the transfer will fail.

Note that the "internal database" is the Enduro database - so it's only checking the name against other transfers successfully processed by Enduro.

aseles13 commented 3 months ago

Is there anything else we need to do with this issue @djjuhasz and @sallain? Or does !548 address this?

djjuhasz commented 3 months ago

@aseles13 we still need to implement a solution for this issue in SDPS Enduro - https://github.com/artefactual-labs/enduro/pull/548 only applies to "Legacy" Enduro.

Diogenesoftoronto commented 3 months ago

I looked at this and it seems that to fully solve this issue it would have to work for arbitrary transfer source locations. That seems like that would mean it would have to look at external databases in other preservation systems, for example an Archivematica Storage Service instance that has transfers. I am curious if that is still the intended solution or if we have decided that would be scope creep for the Enduro project.

sallain commented 3 months ago

Per offline discussion, we're going to keep the first iteration simple - an internal database will record new transfers that are completed and check against that. The feature will not try to look back in time at packages ingested before the feature was implemented.

In the future I could see someone perhaps wanting to connect another database source, or maybe wanting to populate the internal database with historical transfers, but we won't worry about that right now.