Closed hogru closed 1 year ago
Hi @hogru! Thanks for opening this issue.
The problem is the url you are using it's not pointing to the file in GitHub but to the GitHub page that allows you to download. If you open the downloaded file with a text editor (or use cat
from the terminal) you'll see you downloaded an HTML file.
You can easily tell that by looking at the url: if you see blob
in there, then it points to the page and not to the file. You want it to say raw
instead. For example, this url actually downloads the file: https://github.com/ml-jku/mhn-react/raw/de0fda32f76f866835aa65a6ff857964302b2178/data/USPTO_50k_MHN_prepro.csv.gz
While creating your pooch, you need to use the url that contains the raw
, check one of the first example snippets in our docs: https://www.fatiando.org/pooch/latest/sample-data.html#basic-setup
The following snippet should work for you:
odie = pooch.create(
path="./testdata",
# base_url="https://github.com/ml-jku/mhn-react/blob/main/data/",
base_url="https://github.com/ml-jku/mhn-react/raw/de0fda32f76f866835aa65a6ff857964302b2178/data/",
registry={
"USPTO_50k_MHN_prepro.csv.gz": None, # Downloads from github change the hash code every time
},
)
for file in odie.registry:
odie.fetch(file, processor=pooch.Decompress())
Let me know if that works for you and I will close this issue. Thanks for reaching out!
Hi @santisoler,
a big thank you for such a quick and thorough response and being kind :-) despite me having the wrong url, also an instance of RTFM ;-) This works of course and also solves the "issue" of the changing hash codes.
I will add a check before fetch()
about the file extension (['.xz', '.gz', '.bz2']
) to decide whether I need the Decompress()
.
Thanks again!
Description of the problem:
So, this might be totally on me since I have found
pooch
only today. I want to download a data file from a public github repository (not mine) and decompress it. The issue is, that the fetched file is much smaller (137KB) than the file on github (2.69MB). When I download the file in a browser I can easily decompress it. So my guess is, that I should fetch the file in a different way, but I can't figure out how. Hope there is an easy fix, assuming it's not a bug?Full code that generated the error
There's more files in reality, but this way I can reproduce the issue. The hash code from the fetched file changes after each download (after deleting the file locally). But this might be a github issue or work as intended.
Full error message
System information
conda
for python install only,poetry
for package installconda list
below:Output of poetry show