LibraryOfCongress / bagit-python

Work with BagIt packages from Python.
http://libraryofcongress.github.io/bagit-python
218 stars 85 forks source link

Download files in fetch.txt #119

Open kba opened 5 years ago

kba commented 5 years ago

Adds a new method Bag.fetch_files_to_be_fetched() that fetches files listed in fetch.txt, c.f. #118.

If this is useful for someone, can be further refined (CLI, overrideable fetch implementation, anti-hammering interval).

coveralls commented 5 years ago

Coverage Status

Coverage decreased (-1.7%) to 81.842% when pulling 1de38c481bcb88ff2a4bf1258f44532f60bcb8a7 on kba:fetch-files into c39b650a80837a1d32314599143d0b5b04159248 on LibraryOfCongress:master.

coveralls commented 5 years ago

Coverage Status

Coverage increased (+0.09%) to 83.595% when pulling 6fda25c0103ec38c040052b57f5aec83f3dc1840 on kba:fetch-files into 8a8263e02d4b2bb95de4835e8d49fc70fd964f97 on LibraryOfCongress:master.

edsu commented 5 years ago

Thanks for diving in to add the fetch functionality @kba. I wonder if it might be a bit more readable to rename fetch_files_to_be_fetched() to fetch() and have it take an optional force parameter that would re-download things that are already present?

edsu commented 5 years ago

Also, I haven't worked with fetch.txt files much before. But I'm kind of surprised that the test suite considers a bag valid if it has a fetch.txt containing an item that is not present in the payload directory.

See: https://github.com/LibraryOfCongress/bagit-python/blob/master/test.py#L1023

kba commented 5 years ago

I wonder if it might be a bit more readable to rename fetch_files_to_be_fetched() to fetch()

Fine with me, I wanted to avoid confusion with fetch_entries.

and have it take an optional force parameter that would re-download things that are already present?

Sure.

I'm kind of surprised that the test suite considers a bag valid if it has a fetch.txt containing an item that is not present in the payload directory.

These tests break the "Every file listed in the fetch file MUST be listed in every payload manifest" rule and it isn't validated. fetch_entries should not just check for unsafe filenames but ensure files is also listed in payload_entries. The validation only checks data on disk and manifest entries. That is a bug.

Since the manifests determines the number and size of files, it could make sense to allow "bag with holes" validation against only the files not mentioned in fetch.txt with a special parameter though, if you don't want to fetch the whole thing. By default,

if it has a fetch.txt containing an item that is not present in the payload directory

should not be valid, you're right.

edsu commented 5 years ago

I guess we could consider validation as a separate issue from this PR though.