Download files in fetch.txt

kba commented 5 years ago

How would I go about completing an incomplete bag, which has files referenced in fetch.txt not present in /data?

Is this outside the domain of the tool or just not implemented? Or have I missed something?

If the latter, would this be an interesting feature for bagit-python or should we implement it on our side?

kba commented 5 years ago

BTW, an ad-hoc solution without any checks etc is this bash one-liner:

while read url size fpath;do mkdir -p "${fpath%/*}"; wget -O"$fpath" "$url";done < fetch.txt

bruth commented 5 years ago

Hi @kba and @acdha I am cross-posting here from the issue @sevein referenced (visible above). https://github.com/archivematica/Issues/issues/583. I also want support for the fetch.txt file, however I only need/want validation and not automatic downloading of the files. For my use case I have bags that contain (reference) TBs of data that are already in archive-quality, content-addressable storage.

My team and I are happy to help contribute in reviews or code to get the validation functionality in at a minimum in lieu of fetching of the files. My feeling is that the default should be to validate and not fetch and rely on a parameter to cause a fetch to occur.

acdha commented 5 years ago

Once the files have been downloaded, the regular bag validation process will handle it. We've been hesitant to put download support into bagit-python because it generally tends to get into a fair amount of code — people tend to ask for things like queuing, retries, concurrency controls, credentials & session management, storage management & cross-bag caching for identical files, etc. and have different opinions about what the answers to those look like.

I think there's a fairly reasonable argument to finish #119 and basically tell people that if they need anything more advanced it's probably best to use whatever system they prefer and simply use bagit-python to validate the final results.

bruth commented 5 years ago

Once the files have been downloaded, the regular bag validation process will handle it. We've been hesitant to put download support into bagit-python because it generally tends to get into a fair amount of code

Yes I agree with that. I am in support of only doing the validation (looking up a data file entry in fetch.txt if found in the manifest file) and not downloading anything.

I think there's a fairly reasonable argument to finish #119

But this does involve downloading the files. Doesn't this contradict with what you said above?

acdha commented 5 years ago

I was just explaining why it hasn't happened before now. I do think there is a valid convenience argument for having a basic downloader for people who don't want anything fancy, however, so I'm open to accepting that pull-request as long as it doesn't get too complicated.

bruth commented 5 years ago

Ok understood. The #119 PR doesn't seem to validate the contents of the fetch.txt with respect to the manifest, so that could be a separate PR to perform that task, correct? If so my team would be happen to contribute this.

acdha commented 5 years ago

I think the idea is that we'd have a simple fetch function and then immediately call validate() afterwards. It looks like https://github.com/LibraryOfCongress/bagit-python/pull/119#issuecomment-444962002 also has some additional validation checks for things listed in fetch.txt which aren't in the manifests, which we should probably handle now but probably as a separate PR, too.

bruth commented 5 years ago

Right and the follow-up comment from @kba asserts the need to validate the fetch.txt file regardless if they are downloaded. Again for my use case, we don't want to download them simply for validation. So just to reiterate scope of this feature, there are two goals

Validate the fetch.txt file if present
- Check that URLs are valid (well-formed)
- Assert the path is listed in the manifest
- This would be baked into the existing validation step for the bag
Support for downloading the files in fetch.txt
- This would be separate from validation
- Fetched files would be materialized into the data/ directory

Is this accurate?

LibraryOfCongress / bagit-python

Download files in fetch.txt #118