Open matthewturk opened 4 years ago
👋 Thanks for opening your first issue here! Please make sure you filled out the template with as much detail as possible.
You might also want to take a look at our Contributing Guide and Code of Conduct.
Hi @matthewturk, I understand the motivation and have thought a bit about this before. If I understand correctly, you want to:
This seems doable but it creates a couple of the problems that would have to resolved:
version
since it would be downloaded again when you make a new release (guaranteeing that you get the updated version).processor
does. For this to work, we would have to introduce a mechanism for that since we would have to check if the archive has been downloaded in the past.Right now, the only part of Pooch that knows about the unpacked files is the processor. So it would make sense to have it interfere with the hash checking somehow. It would then check if the unpacked files exist and bypass the checking.
I'm not entire sure how to implement this. Any suggestions would be welcome!
It's good that this came up right now since I'm refactoring the downloading code. I'll keep this in mind when going forward.
One idea to make this work:
Add a new argument to fetch
that disables the check for updates. So we can do if allow_updates and hash_matches(...)
. This would bypass the hash check if he file already exists while retaining the check right after download. We can then add an option to the processors to delete the archive and maintain a text file placeholder. This would only work if the two options are used together so the processor should issue a warning/exception if the action
is "update"
(indicating that the hash check failed).
This is something we also want over at MNE-Python. Right now we have wrapper code that does this but it is inelegant / error-prone.
An idea that occurs to me that might (?) be fairly easy to implement is for the processor to write a text file (called .pooch_hash.txt
or similar?) and in it write the hash of the archive before it is deleted. Then on subsequent requests, pooch.retrieve
or pooch.Pooch.fetch
could first check for the presence of the archive file; if it's missing they could look for .pooch_hash.txt
and check the hash stored in it (and if it matches, don't download).
Questions:
.pooch_hash.txt
? in the same location as the archive, or inside the extract directory?.pooch_hash.txt
is in the same location as the archive, what happens when multiple archives are fetched to the same folder? Should there be a single .pooch_hashes.txt
that has filename: hash mappings?NB: This assumes that either the users haven't altered the actual files after extraction, or that they're OK with the files changing... but that is also true in the case where the user keeps the original archive file on disk so I'm not sure there's any new risk there.
@drammock using a file with the hash of the archive would work and you're right that this wouldn't introduce new issues with unpacked content changing (we can think about that if it becomes an issue later).
I would prefer to have one text file per archive that is replaced (e.g., myarchive.zip -> myarchive.zip.pooch_hash
). That makes the checking easier in the fetch code (if the archive is missing check for f"{fname}.pooch_hash
) and we don't have to maintain a database file updated. If the archive is re-downloaded, the hash file is overwritten with a new one and everything is up to date. This would also avoid race conditions when multiple downloaders try to access the text file at the same time (for parallel downloads).
Looking at the code, this would require:
download_action
: https://github.com/fatiando/pooch/blob/bc32d4eecec115e1fdf9bd4e306df5a6c22661fd/pooch/core.py#L663ExtractorProcessor
: https://github.com/fatiando/pooch/blob/bc32d4eecec115e1fdf9bd4e306df5a6c22661fd/pooch/processors.py#L22With this implementation, Pooch.fetch
and retrieve
don't need to know if the archive is deleted or not so the user code is a lot simpler (a single option passed to the processor).
Description of the desired feature
For big archives, it is sometimes desirable to delete the initial archive file (against which the hash is made) and retain the unextracted files, and not require a full re-download the next time it's used.
Granted, this opens up a vector of corruption, where the uncompressed files might be modified, but I think this is unlikely to be a problem.
It might be possible to do this by having a zero-size stamp or something that says, "verified," but I don't really know what would fit best.
A common pattern in yt is:
and this extracts a
.tar.gz
file. But, it's a couple hundred megs. So sometimes we might want to kill the intermediate archive once it's there, so that it doesn't double-up on storage requirements. (One could even imagine aremove_intermediate
option!) We'd have to record that the file doesn't need to be re-obtained, though.Does this seem like a possibility?
Are you willing to help implement and maintain this feature? Yes