fatiando / pooch

A friend to fetch your data files
https://www.fatiando.org/pooch
Other
622 stars 74 forks source link

Delete archives, retain the extracted files, and don't re-download #158

Open matthewturk opened 4 years ago

matthewturk commented 4 years ago

Description of the desired feature

For big archives, it is sometimes desirable to delete the initial archive file (against which the hash is made) and retain the unextracted files, and not require a full re-download the next time it's used.

Granted, this opens up a vector of corruption, where the uncompressed files might be modified, but I think this is unlikely to be a problem.

It might be possible to do this by having a zero-size stamp or something that says, "verified," but I don't really know what would fit best.

A common pattern in yt is:

ds = yt.load_sample("IsolatedGalaxy")

and this extracts a .tar.gz file. But, it's a couple hundred megs. So sometimes we might want to kill the intermediate archive once it's there, so that it doesn't double-up on storage requirements. (One could even imagine a remove_intermediate option!) We'd have to record that the file doesn't need to be re-obtained, though.

Does this seem like a possibility?

Are you willing to help implement and maintain this feature? Yes

welcome[bot] commented 4 years ago

👋 Thanks for opening your first issue here! Please make sure you filled out the template with as much detail as possible.

You might also want to take a look at our Contributing Guide and Code of Conduct.

leouieda commented 4 years ago

Hi @matthewturk, I understand the motivation and have thought a bit about this before. If I understand correctly, you want to:

  1. Download the archive and check the hash against the registry
  2. Unpack the files
  3. Delete the archive
  4. Next time, don't check the hash anymore

This seems doable but it creates a couple of the problems that would have to resolved:

  1. In case the registry is updated, Pooch wouldn't catch that and re-download the archive. This is not a problem if you're setting version since it would be downloaded again when you make a new release (guaranteeing that you get the updated version).
  2. Right now, the download part has no knowledge of what the processor does. For this to work, we would have to introduce a mechanism for that since we would have to check if the archive has been downloaded in the past.
  3. The major difficulty is checking the hash once and then not again. Not checking at all would be straight-forward: after deleting the archive, leave a placeholder text file with the same name so that Pooch thinks it's already there. But checking the hash once and then not again requires carrying that knowledge from one session to another (which would have to be done through files somehow).

Right now, the only part of Pooch that knows about the unpacked files is the processor. So it would make sense to have it interfere with the hash checking somehow. It would then check if the unpacked files exist and bypass the checking.

I'm not entire sure how to implement this. Any suggestions would be welcome!

It's good that this came up right now since I'm refactoring the downloading code. I'll keep this in mind when going forward.

leouieda commented 4 years ago

One idea to make this work:

Add a new argument to fetch that disables the check for updates. So we can do if allow_updates and hash_matches(...). This would bypass the hash check if he file already exists while retaining the check right after download. We can then add an option to the processors to delete the archive and maintain a text file placeholder. This would only work if the two options are used together so the processor should issue a warning/exception if the action is "update" (indicating that the hash check failed).

drammock commented 3 years ago

This is something we also want over at MNE-Python. Right now we have wrapper code that does this but it is inelegant / error-prone.

An idea that occurs to me that might (?) be fairly easy to implement is for the processor to write a text file (called .pooch_hash.txt or similar?) and in it write the hash of the archive before it is deleted. Then on subsequent requests, pooch.retrieve or pooch.Pooch.fetch could first check for the presence of the archive file; if it's missing they could look for .pooch_hash.txt and check the hash stored in it (and if it matches, don't download).

Questions:

  1. where to store .pooch_hash.txt? in the same location as the archive, or inside the extract directory?
  2. if .pooch_hash.txt is in the same location as the archive, what happens when multiple archives are fetched to the same folder? Should there be a single .pooch_hashes.txt that has filename: hash mappings?

NB: This assumes that either the users haven't altered the actual files after extraction, or that they're OK with the files changing... but that is also true in the case where the user keeps the original archive file on disk so I'm not sure there's any new risk there.

leouieda commented 3 years ago

@drammock using a file with the hash of the archive would work and you're right that this wouldn't introduce new issues with unpacked content changing (we can think about that if it becomes an issue later).

I would prefer to have one text file per archive that is replaced (e.g., myarchive.zip -> myarchive.zip.pooch_hash). That makes the checking easier in the fetch code (if the archive is missing check for f"{fname}.pooch_hash) and we don't have to maintain a database file updated. If the archive is re-downloaded, the hash file is overwritten with a new one and everything is up to date. This would also avoid race conditions when multiple downloaders try to access the text file at the same time (for parallel downloads).

Looking at the code, this would require:

With this implementation, Pooch.fetch and retrieve don't need to know if the archive is deleted or not so the user code is a lot simpler (a single option passed to the processor).