archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: Archivematica should not require 1-to-1 relationship between originals and access files in bags #642

Open ablwr opened 5 years ago

ablwr commented 5 years ago

Expected behaviour

Unzipped bags with access folders should process through Archivematica without failing.

Current behaviour

I've been working on a client's files and they are experiencing failure for bags structured in a specific way.

The (unzipped) test bag contains, for example, a directory structure approximately like this:

├── data
│   └── objects
│       ├── access
│       │   └── file_a.mp3
│       ├── file_a.flac
│       └── file_a.wav

The structure has two master files as flac and wav, and one access file as mp3, all with the same name. This is fine for a Standard transfer, but not fine for an Unzipped Bag transfer.

If one of the files is removed (leaving a 1-1 connection), the Unzipped Bag transfer will succeed.

If this is meant to occur, it should fail in a more specific way and give the user information about it. Perhaps at the Normalization stage, if the issue is an imbalance between the preservation and access copies of a file.

But I think this is a bug because it fails 1) at different parts of the workflow 2) during the Move microservices, rather than in an understood way, as defined by Archivematica's documentation and way of handling objects, bagged or not.

What are the differences in workflow between an unzipped bag and a standard transfer, other than the initial bag-checking/verification stage? For some reason, the unzipped bag sent as a Standard transfer type will process without problems, and only fail during the unzipped bag transfer type.

I suppose this also reveals that the expected way of transfer is not super clear -- are there rules that prevent anything other than a 1-1 connection between access and preservation objects? If so, should the Standard transfer be failing? So the documentation should be more clear about what is and is not acceptable for these "special" files/folders in Archivematica.

But for the scope of this issue: Users are not able to process unzipped bags with a structure like that above (and possibly beyond).

Steps to reproduce

Create an unzipped bag that contains two preservation objects with the same name, and one access object with the same name.

Your environment (version of Archivematica, OS version, etc)

CentOS AM 1.9, Ubuntu AM 1.9x


For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

ross-spencer commented 5 years ago

Hi @ablwr, if it helps, I often describe transfer as a funnel into ingest. By the time we reach normalization a lot of the differences between transfer types has been ironed out. As such, you can recreate your issue with multiple transfer types:

image

But it's important in the standard transfer to select the objects directory, or the data dir specifically, otherwise the Check for access directory job will return no access directory in this sip as it is looking too far up the directory tree for that information.

image

That should help narrow down the questions around what an outcome for this issue looks like.

sallain commented 5 years ago

I tested this without a bag, just to see what Archivematica's default behaviour would be with a standard transfer. My transfer looked like this:

double-master/
├── access
│   └── sky.jpg
├── sky.png
└── sky.tif

Transferring this as a Standard transfer resulted in failure at Normalization, specifically at Job Check for Access directory. This is more or less what I expected - Archivematica wants there to be a one-to-one relationship between the originals and the access copies. However, there's no stdout/stderr output at all for that job.

I agree with @ablwr's suggestion in the initial comment that, at very least, the error reporting around this issue should be improved. Possibly also the documentation should state that there must be a 1-to-1 relationship between originals and derivatives for manually-normalized materials

Version: 1.9.1

sallain commented 5 years ago

I also tested this with an unzipped bag, with the same basic structure:

double-master-bag/
├── bag-info.txt
├── bagit.txt
├── data
│   ├── access
│   │   └── sky.jpg
│   ├── sky.png
│   └── sky.tif
├── manifest-md5.txt
└── tagmanifest-md5.txt

This also failed at Job: Check for Access directory, which is good because it means that the failure is happening at the same point and that the output is the same (that is, nothing).

@ablwr I think it's possible that the other issues in the bag that you were working with are causing the random failures that you're experiencing? This particular failure seems to be at least consistent across transfer types, if not helpful.

sallain commented 5 years ago

Note: the unzipped bag transfer in my previous comment also fails in 1.8

sallain commented 5 years ago

@ablwr since it seems like the specific problem here is having two master files and one access file, it might make sense to change the issue name to reflect that. When that is the only problem with the bag, the failures do happen consistently.

ablwr commented 5 years ago

Thanks so much for investigating! I suppose this is more like a feature request than a bug.. I will change the name.