archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Respect bag-it fetch.txt file #583

Open bruth opened 5 years ago

bruth commented 5 years ago

Please describe the problem you'd like to be solved.

The BagIt spec defines a fetch.txt file for referring to remote files that should be considered as part of the bag. Per this linked section:

Every file listed in the fetch file MUST be listed in every payload manifest. A fetch file MUST NOT list any tag files.

It does not appear that this file is respected. I only tested this on AM 1.7, but the 1.9 demo does not include a bag example utilizing the fetch.txt file either.

Describe the solution you'd like to see implemented.

A basic solution would be to respect this file as a fallback if a file in the manifest is not physically present in the data/ directory. If a fetch.txt file exists and the path from the manifest exists in fetch.txt then validation should pass.

An additional check (which should be optional) would be to validate the checksums of the remote files. This may not be desirable or feasible (auth required, file size is huge, etc). However it could be a nice additional option.

Describe alternatives you've considered.

n/a

Additional context

n/a


For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

sromkey commented 5 years ago

Thanks for this idea @bruth . You are correct that Archivematica has never implemented a fetch file in its bags.

bruth commented 5 years ago

Thanks @sromkey. I believe this check is performed here? Are there other client scripts that would need to be aware of the fetch.txt file natively? In our use case, we will be validating the bag prior to uploading it to the AM transfer space, but we just don't want validation to fail due to the scenario I stated above.

One workaround would be to simply not choose "zipped/unzipped bag" as the transfer type. However I was not sure of the ramifications of how the contents would be processed/re-structured if AM doesn't know its a bag.

sromkey commented 5 years ago

Ohhh I'm sorry- I misunderstood completely. I thought your request was for the bag that Archivematica makes as the AIP support fetch! Do you have a sample bag I could test with by chance? I'm curious to see the behaviour. If you can, contact me off github- sromkey [at] artefactual.com

bruth commented 5 years ago

I thought your request was for the bag that Archivematica makes as the AIP support fetch!

Ah! Right that would be more difficult to manage for sure. Here is a minimal bag example with a single file listed in the manifest to the path data/TechCrunchcontinentalUSA.csv (with the correct checksum) and an entry in fetch.txt of the remote URL to that same path.

The output error from AM (v1.7) was:

Result is false.
(error) Payload manifest manifest-sha256.txt contains missing file(s): [data/TechCrunchcontinentalUSA.csv]

So it did not acknowledge the entry in the fetch.txt even though that data path is listed.

sevein commented 5 years ago

Hi @bruth. We've recently pushed changes (not released yet) to use bagit-python where bagit-java v4 (via CLI) was used before. I thought that could bring different results but I did a quick test using your minimal bag example and the library raises a bagit.BagValidationError:

>>> import bagit
>>> bagit.VERSION
'1.7.0'
>>> bagit.Bag(os.getcwd()).validate()
data/TechCrunchcontinentalUSA.csv exists in manifest but was not found on filesystem
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jesus/.local/lib/python3.6/site-packages/bagit.py", line 603, in validate
    processes=processes, fast=fast, completeness_only=completeness_only
  File "/home/jesus/.local/lib/python3.6/site-packages/bagit.py", line 785, in _validate_contents
    self._validate_completeness()
  File "/home/jesus/.local/lib/python3.6/site-packages/bagit.py", line 853, in _validate_completeness
    raise BagValidationError(_("Bag validation failed"), errors)
bagit.BagValidationError: Bag validation failed: data/TechCrunchcontinentalUSA.csv exists in manifest but was not found on filesystem

validate() did not raise when the file had been previously downloaded.

A basic solution would be to respect this file as a fallback if a file in the manifest is not physically present in the data/ directory. If a fetch.txt file exists and the path from the manifest exists in fetch.txt then validation should pass.

So maybe this could be done, but if I understood correctly that'd be a change upstream in bagit-python. This issue seems related: https://github.com/LibraryOfCongress/bagit-python/issues/118.

We could also have Archivematica download the files before validation but I understand that wouldn't be always desirable or something that would work consistently.

bruth commented 5 years ago

Thanks @sevein. I will look into that bagit-python issue and see if I can help move that issue along.

We could also have Archivematica download the files before validation but I understand that wouldn't be always desirable or something that would work consistently.

We are working with bags containing many TBs of data (genomic data in this case) which is why we are using the fetch.txt file to be begin. We have an external process for managing that data in content-addressable storage which makes it much easier to produce a fetch.txt file and manifest entries (containing the hash) for those large files.

ross-spencer commented 5 years ago

nb. now supported in Steffen's Golang bag tool: https://github.com/steffenfritz/bagit/pull/6