Closed leouieda closed 5 years ago
@santisoler what do you think?
Let me see if I understood. You're saying that instead of having our local path
as:
mypackage
|--> v0.1
| |-> data_file.dat
|--> v0.2
| |-> data_file.dat
|--> v0.3
|-> data_file.dat
We should just need to have this?
mypackage
|--> data_file.dat
If that is the question, it's kind of tricky. On one side, we don't want to download all datasets just because we made a new release of the package. But on the other hand, we may want to keep track of different versions of the data in case it changes from one release to the other, in which it may be a good idea to store both versions in case we want to go back and do something with the old one.
I think the second scenario is very rare, and I would choose to don't version the data cache. Considering that decision, I would recommend adding a warning in case the local file hash does not match the one on the registry, i.e. the file that will be fetched is overriding the one in the local directory.
For example, say we build two Pooch
s: one downloads Pooch's README.rst
and the other one fetches the Harmonica's README.rst
:
import pooch
POOCH = pooch.create(
path=pooch.os_cache("readmes"),
base_url="https://github.com/fatiando/pooch/raw/master/",
registry={
"README.rst": "196942bef248512ee54bf4d78a23d156acb411f85054c4cf7a7f786d3e81f710"
}
)
HARMONICA = pooch.create(
path=pooch.os_cache("readmes"),
base_url="https://github.com/fatiando/harmonica/raw/master/",
registry={
"README.rst": "2a9b6a411d92670187642d4edd4cb2ff12eec4511f5c928ac7403262101ad3dc"
}
)
Let's download first the Pooch's README.rst
:
POOCH.fetch("README.rst")
This creates the readmes
directory in our cache directory and fetches the desired README.md
.
Then we download the Harmonica's README.rst
:
HARMONICA.fetch("README.rst")
This overrides the Pooch's README.rst
by default without notifying the user.
Maybe we should add a warning and/or an option for overriding the local file (and maybe making a backup of the old file) to Pooch.fetch()
.
We should just need to have this?
@santisoler that's it exactly.
I think the second scenario is very rare, and I would choose to don't version the data cache.
That was my thinking. I think it will be a lot more common for people to have problems with multiple copies of the same big files than the files updating without changing the file name (which is the only way that this would break).
I would recommend adding a warning in case the local file hash does not match the one on the registry, i.e. the file that will be fetched is overriding the one in the local directory.
Pooch already does this: https://github.com/fatiando/pooch/blob/master/pooch/core.py#L264
Maybe we should add a warning and/or an option for overriding the local file (and maybe making a backup of the old file)
That's an interesting idea. The warning is already there but there could be a more explicit message saying that file will be overwritten (which is implied by the word "Updating"). I like the idea of telling Pooch to raise an exception instead of updating a file. This could be an option Pooch.fetch(..., update=False)
or something that is set globally on Pooch.__init__
and create
. If that's the case, we could add a message saying "To download a new version of this file, delete the existing one at {}." I think I like the global option better because it would be easy to forget to add update=False
and it would be hard to catch this bug in the tests.
This is such a brilliant idea that I apparently already implemented it 5 months ago: https://github.com/fatiando/rockhound/blob/master/rockhound/registry.py#L10
Description of the desired feature
Maybe we shouldn't be versioning the data cache location. This triggers new downloads of all datasets every time we get a new release. Since these datasets aren't likely to change without releasing a new version, we probably don't need this.