artefactual-sdps / enduro

A tool to support ingest and automation in digital preservation workflows
https://enduro.readthedocs.io/
Apache License 2.0
4 stars 3 forks source link

Problem: hidden files interfere with bag validation #850

Closed sallain closed 3 months ago

sallain commented 5 months ago

Is your feature request related to a problem? Please describe.

When submitting a transfer in the bag format, the bag validation check in Archivematica/a3m will fail if there are hidden files in the bag because the hidden files are not included in the bag manifest. This is a particular issue for Mac users, since Macs often add dotfiles (e.g. .DS_Store). This can be remedied by the user by manually removing hidden files from the bags before they are transferred; however, this is both cumbersome and limiting, since the dotfiles can be created every time the user interacts with a file.

A bag validation failure in Archivematica stops the transfer process altogether, so the user has to identify the bag that errored out, remove the hidden files, and restart the ingest.

Describe the solution you'd like

I'd like to prevent any hidden files from being transferred with the bag. The solution should check for and remove hidden files before the bag transfer is ingested into Archivematica/a3m.

In Legacy Enduro, this is done at the point when the bag is copied from the transfer source location to the processing location. Any file beginning with a . is not copied.

This feature should be configurable, so that users can keep hidden files if they choose. It would also be preferrable to allow users to edit the list of files that should be removed/ignored, as is done in Archivematica/a3m's Remove hidden files and directories and Remove unneeded files jobs.

Describe alternatives you've considered

The manual method mentioned above does work but it is susceptible to human error, and might need to be repeated should the user have a need to look at the files in the bag.

Additional context

Note that this is only a requirement for bagged transfers. For standard and other non-bagged transfer types, Archivematica/a3m remove hidden files as a matter of course during the early stages of processing.

The client for whom this has been an issue uses unzipped bags, which both lends itself to the problem manifesting AND provides the easy solution of simply not copying dotfiles, as is done in Legacy Enduro. I'm not sure how the issue would be dealt with in a zipped bag, where the whole bag is copied as a single entity. Perhaps focusing on the unzipped bag example is the easiest starting point.

jhsimpson commented 5 months ago

@sallain should there be a premis event recorded for the file removal?

djjuhasz commented 5 months ago

Note that due to issue #845 Enduro can not currently process unzipped Bags that are uploaded via MinIO. Enduro should be able to process unzipped Bags using the filesystem watcher, but I've never tested this option to confirm it works.

djjuhasz commented 5 months ago

@sallain should there be a premis event recorded for the file removal?

I think this raises an interesting point. If a hidden file (e.g. .DS_Store) is present when the Bag is created, then I think BagIt will add the file to the Bag manifest and checksum files, and in this case we should not remove the hidden file because it will cause validation to fail. If a hidden file is added after the Bag is created, then it will have no record in the Bag manifest or checksum files, so it must be removed for the Bag to validate.

In the second case, I don't think there is any need to add a PREMIS event about the removal of the hidden file - the file clearly was not meant to be part of the transfer payload.

sallain commented 5 months ago

@djjuhasz @jhsimpson The use case as described certainly falls into the latter category, which you could expand on to say that the hidden files are both unexpected and unwanted. However, in my opinion it's still a material change to the bag as deposited, regardless of whether or not the user intended for the hidden file to be there, so the question is - do we need to record that the system made this change in order for the system to be a responsible steward of this data?

Along with removing files, though, in order to record a PREMIS event the system would also need to ADD a file to the transfer. The current mechanism for recording an external PREMIS event is to create a premis.xml file, which contains the event. The premis.xml file is stored in the transfer's metadata directory, and the contents are parsed into the METS file. This prompts more questions:

So, is there enough value added by recording the file removal to say yes to question 1 and figure out a solution to question 2? Would love to hear your thoughts.

We don't need to copy the Archivematica way of doing it, but it's worth noting that Archivematica does not create a PREMIS event for the removal of hidden files.

djjuhasz commented 4 months ago

I've done a bunch of work on this issue on branch dev/issue-850-remove-hidden-files, but it's turned into a giant PR. I'm going to start again from main, and break up the changes into a number of smaller issues and PRs:

djjuhasz commented 3 months ago

@Diogenesoftoronto developed https://github.com/artefactual-sdps/remove-files-activity to remove hidden files from a transfer. Because we are planning to use that script in preprocessing as a child workflow there's no need to re-implement it here.

aseles13 commented 3 months ago

Does this issue need to be closed? Or are there things that still need to be done? @sallain

djjuhasz commented 2 months ago

I've created https://github.com/artefactual-sdps/temporal-activities/issues/2 for a new implementation of the remove hidden files activity.