fatiando / rockhound

NOTICE: This library is no longer being developed. Use Ensaio instead (https://www.fatiando.org/ensaio). -- Download geophysical models/datasets and load them in Python
BSD 3-Clause "New" or "Revised" License
34 stars 15 forks source link

Stop versioning the data cache directory #18

Closed leouieda closed 5 years ago

leouieda commented 5 years ago

Description of the desired feature

Maybe we shouldn't be versioning the data cache location. This triggers new downloads of all datasets every time we get a new release. Since these datasets aren't likely to change without releasing a new version, we probably don't need this.

leouieda commented 5 years ago

@santisoler what do you think?

santisoler commented 5 years ago

Let me see if I understood. You're saying that instead of having our local path as:

mypackage
  |--> v0.1
  |      |-> data_file.dat
  |--> v0.2
  |      |-> data_file.dat
  |--> v0.3
         |-> data_file.dat

We should just need to have this?

mypackage
  |--> data_file.dat

If that is the question, it's kind of tricky. On one side, we don't want to download all datasets just because we made a new release of the package. But on the other hand, we may want to keep track of different versions of the data in case it changes from one release to the other, in which it may be a good idea to store both versions in case we want to go back and do something with the old one.

I think the second scenario is very rare, and I would choose to don't version the data cache. Considering that decision, I would recommend adding a warning in case the local file hash does not match the one on the registry, i.e. the file that will be fetched is overriding the one in the local directory.

For example, say we build two Poochs: one downloads Pooch's README.rst and the other one fetches the Harmonica's README.rst:

import pooch

POOCH = pooch.create(
    path=pooch.os_cache("readmes"),
    base_url="https://github.com/fatiando/pooch/raw/master/",
    registry={
        "README.rst": "196942bef248512ee54bf4d78a23d156acb411f85054c4cf7a7f786d3e81f710"
    }
)

HARMONICA = pooch.create(
    path=pooch.os_cache("readmes"),
    base_url="https://github.com/fatiando/harmonica/raw/master/",
    registry={
        "README.rst": "2a9b6a411d92670187642d4edd4cb2ff12eec4511f5c928ac7403262101ad3dc"
    }
)

Let's download first the Pooch's README.rst:

POOCH.fetch("README.rst")

This creates the readmes directory in our cache directory and fetches the desired README.md.

Then we download the Harmonica's README.rst:

HARMONICA.fetch("README.rst")

This overrides the Pooch's README.rst by default without notifying the user. Maybe we should add a warning and/or an option for overriding the local file (and maybe making a backup of the old file) to Pooch.fetch().

leouieda commented 5 years ago

We should just need to have this?

@santisoler that's it exactly.

I think the second scenario is very rare, and I would choose to don't version the data cache.

That was my thinking. I think it will be a lot more common for people to have problems with multiple copies of the same big files than the files updating without changing the file name (which is the only way that this would break).

I would recommend adding a warning in case the local file hash does not match the one on the registry, i.e. the file that will be fetched is overriding the one in the local directory.

Pooch already does this: https://github.com/fatiando/pooch/blob/master/pooch/core.py#L264

Maybe we should add a warning and/or an option for overriding the local file (and maybe making a backup of the old file)

That's an interesting idea. The warning is already there but there could be a more explicit message saying that file will be overwritten (which is implied by the word "Updating"). I like the idea of telling Pooch to raise an exception instead of updating a file. This could be an option Pooch.fetch(..., update=False) or something that is set globally on Pooch.__init__ and create. If that's the case, we could add a message saying "To download a new version of this file, delete the existing one at {}." I think I like the global option better because it would be easy to forget to add update=False and it would be hard to catch this bug in the tests.

leouieda commented 5 years ago

This is such a brilliant idea that I apparently already implemented it 5 months ago: https://github.com/fatiando/rockhound/blob/master/rockhound/registry.py#L10