archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: Packages are being deleted following extract #116

Closed ross-spencer closed 5 years ago

ross-spencer commented 6 years ago

Expected behaviour

When packages are expected to be kept post extract, they must also appear in the AIP when downloaded. Further, the METS and Bagit manifests needs to reflect that the package is still part of the archival package.

Current behaviour

The packages are not kept at present. It seems that the AIP is not validated against a manifest which describes the existence of the file before extract. The package is still in the SIP when in the Backlog, but it is not in the AIP when downloaded. I cannot quite tell yet where in the ingest workflow the file is being removed.

Steps to reproduce

  1. Find a group of files with a zip file and select it for transfer
  2. When the extract packages decision point comes up, select; Extract Packages
  3. When the delete packages after extraction decision point comes up, select: No
  4. From here, there are two things you can observe. The package is still visible in the SIP when downloaded from Backlog. The package will not be visible when the AIP is downloaded.

NB. This seems like it might be connected to some of the changes made in https://github.com/artefactual/archivematica/pull/1178 so it is likely some of the permutations there haven't been fully considered. There are likely complications due to the renaming of the package, but I haven't been able to narrow this down yet.

ross-spencer commented 6 years ago

For reference, for now, the workaround for this issue in the qa/1.x branch is to either not extract packages, or to delete packages post extract.

jrwdunham commented 5 years ago

The source of this issue is the fact that PR 1178 modified the package extraction behaviour so that the original package file is now being renamed on disk: it is now suffixed with a date string, whereas before it was the extraction output directory that was suffixed with that date. Because of this renaming, the name on disk no longer matches the name of the package file in the Files table in the database.

However, simply modifying the extracted package's File record in the database so that its currentlocation attribute matches the correct path on disk is insufficient. This is because the has_packages.py::already_extracted function crucially assumes that the extracted files' currentlocation values have the package's currentlocation as a prefix. However, this assumption no longer holds because, as described above, the package file now has a date suffix while the extracted files do not.

The proposed solution is to introduce a new delimiter when suffixing the date string to the package file's currentlocation path. This AM_DATE_DELIMITER has the value '-AM-DATE-SFX-'. It should be highly unlikely that incoming data will contain files whose names contain this substring. Now when checking if a package has already been extracted we must remove the date and the delimiter and check for files and extraction events that match the package file's currentlocation minus the delimiter and date.

The proposed solution has been implemented in PR https://github.com/artefactual/archivematica/pull/1245

sromkey commented 5 years ago

Unfortunately suffixing the package, rather than the extraction output directory, is not an appropriate solution from an archival point of view because it's changing the name of an original object. My understanding is that this necessitates a change in #1178 which is already merged- I'll file a blocking issue.

ross-spencer commented 5 years ago

Hi @sromkey, I don't disagree about partially reverting the work in artefactual/archivematica#1178 - the circumstances behind the change were less than ideal. But out of curiosity, the rationale, is that accurate given how the PREMIS records this change? One strategy I've seen in other system's AIPs is to have all objects renamed to UUIDs that might, for example, mitigate character encoding issues further up the chain. The METS/PREMIS becomes the canonical source for the original name. We also have the sanitize name microservice, so we also change the name of the original object?

sromkey commented 5 years ago

@ross-spencer I totally understand your points. I conferred with other analysts and our conclusion is it's easier to support renaming in the case of sanitization because we're consistent about it, while renaming packages to include a suffix is more arbitrary, especially because it's really a side-effect of trying to fix a different problem. It would be difficult to explain to a future user or different system why this name change took place. I hope that makes sense but am happy to keep chatting about it!

jrwdunham commented 5 years ago

I believe that this issue should be closed in favour of this one which more clearly describes our revised understanding of the core issue, given the discussion above: https://github.com/archivematica/Issues/issues/201