Open kba opened 5 years ago
Thanks for diving in to add the fetch functionality @kba. I wonder if it might be a bit more readable to rename fetch_files_to_be_fetched()
to fetch()
and have it take an optional force
parameter that would re-download things that are already present?
Also, I haven't worked with fetch.txt files much before. But I'm kind of surprised that the test suite considers a bag valid if it has a fetch.txt containing an item that is not present in the payload directory.
See: https://github.com/LibraryOfCongress/bagit-python/blob/master/test.py#L1023
I wonder if it might be a bit more readable to rename fetch_files_to_be_fetched() to fetch()
Fine with me, I wanted to avoid confusion with fetch_entries
.
and have it take an optional force parameter that would re-download things that are already present?
Sure.
I'm kind of surprised that the test suite considers a bag valid if it has a fetch.txt containing an item that is not present in the payload directory.
These tests break the "Every file listed in the fetch file MUST be listed in every payload manifest" rule and it isn't validated. fetch_entries
should not just check for unsafe filenames but ensure files is also listed in payload_entries
. The validation only checks data on disk and manifest entries. That is a bug.
Since the manifests determines the number and size of files, it could make sense to allow "bag with holes" validation against only the files not mentioned in fetch.txt with a special parameter though, if you don't want to fetch the whole thing. By default,
if it has a fetch.txt containing an item that is not present in the payload directory
should not be valid, you're right.
I guess we could consider validation as a separate issue from this PR though.
Adds a new method
Bag.fetch_files_to_be_fetched()
that fetches files listed infetch.txt
, c.f. #118.If this is useful for someone, can be further refined (CLI, overrideable fetch implementation, anti-hammering interval).