fatiando / pooch

A friend to fetch your data files
https://www.fatiando.org/pooch
Other
605 stars 72 forks source link

Fetching a `gzip` file from github results in a corrupted local file #338

Closed hogru closed 1 year ago

hogru commented 1 year ago

Description of the problem:

So, this might be totally on me since I have found pooch only today. I want to download a data file from a public github repository (not mine) and decompress it. The issue is, that the fetched file is much smaller (137KB) than the file on github (2.69MB). When I download the file in a browser I can easily decompress it. So my guess is, that I should fetch the file in a different way, but I can't figure out how. Hope there is an easy fix, assuming it's not a bug?

Full code that generated the error

There's more files in reality, but this way I can reproduce the issue. The hash code from the fetched file changes after each download (after deleting the file locally). But this might be a github issue or work as intended.

odie = pooch.create(
    path="./testdata",
    # base_url="https://github.com/ml-jku/mhn-react/blob/main/data/",
    base_url="https://github.com/ml-jku/mhn-react/blob/de0fda32f76f866835aa65a6ff857964302b2178/data/",
    registry={
        "USPTO_50k_MHN_prepro.csv.gz": None,  # Downloads from github change the hash code every time
    },
)
for file in odie.registry:
    odie.fetch(file, processor=pooch.Decompress())

Full error message

gzip.BadGzipFile: Not a gzipped file (b'\n\n')

System information

Output of poetry show

absl-py                  1.4.0       Abseil Python Common Libraries, see ht...
aiohttp                  3.7.4.post0 Async http client/server framework (as...
anyio                    3.6.2       High level compatibility layer for mul...
appdirs                  1.4.4       A small Python module for determining ...
appnope                  0.1.3       Disable App Nap on macOS >= 10.9
argon2-cffi              21.3.0      The secure Argon2 password hashing alg...
argon2-cffi-bindings     21.2.0      Low-level CFFI bindings for Argon2
arrow                    1.2.3       Better dates & times for Python
asttokens                2.2.1       Annotate AST trees with source code po...
async-timeout            3.0.1       Timeout context manager for asyncio pr...
attrs                    22.2.0      Classes Without Boilerplate
backcall                 0.2.0       Specifications for callback functions ...
beautifulsoup4           4.11.1      Screen-scraping library
black                    22.12.0     The uncompromising code formatter.
bleach                   6.0.0       An easy safelist-based HTML-sanitizing...
cachetools               5.3.0       Extensible memoizing collections and d...
certifi                  2022.12.7   Python package for providing Mozilla's...
cffi                     1.15.1      Foreign Function Interface for Python ...
chardet                  4.0.0       Universal encoding detector for Python...
charset-normalizer       3.0.1       The Real First Universal Charset Detec...
click                    8.1.3       Composable command line interface toolkit
codetiming               1.4.0       A flexible, customizable timer for you...
colorama                 0.4.6       Cross-platform colored terminal text.
comm                     0.1.2       Jupyter Python Comm implementation, fo...
contourpy                1.0.7       Python library for calculating contour...
coverage                 7.1.0       Code coverage measurement for Python
cycler                   0.11.0      Composable style cycles
datasets                 2.8.0       HuggingFace community-driven open-sour...
debugpy                  1.6.6       An implementation of the Debug Adapter...
decorator                5.1.1       Decorators for Humans
defusedxml               0.7.1       XML bomb protection for Python stdlib ...
dill                     0.3.6       serialize all of python
docker-pycreds           0.4.0       Python bindings for the docker credent...
entrypoints              0.4         Discover and load entry points from in...
evaluate                 0.4.0       HuggingFace community-driven open-sour...
exceptiongroup           1.1.0       Backport of PEP 654 (exception groups)
executing                1.2.0       Get the currently executing AST node o...
fastjsonschema           2.16.2      Fastest Python implementation of JSON ...
filelock                 3.9.0       A platform independent file lock.
flake8                   6.0.0       the modular source code checker: pep8 ...
fonttools                4.38.0      Tools to manipulate font files
fqdn                     1.5.1       Validates fully-qualified domain names...
fsspec                   2023.1.0    File-system specification
fuzzywuzzy               0.18.0      Fuzzy string matching in python
gitdb                    4.0.10      Git Object Database
gitpython                3.1.30      GitPython is a python library used to ...
google-auth              2.16.0      Google Authentication Library
google-auth-oauthlib     0.4.6       Google Authentication Library
grpcio                   1.51.1      HTTP/2-based RPC framework
huggingface-hub          0.11.1      Client library to download and publish...
humanfriendly            10.0        Human friendly output for text interfa...
idna                     3.4         Internationalized Domain Names in Appl...
importlib-metadata       6.0.0       Read metadata from Python packages
iniconfig                2.0.0       brain-dead simple config-ini parsing
ipykernel                6.20.2      IPython Kernel for Jupyter
ipython                  8.8.0       IPython: Productive Interactive Computing
ipython-genutils         0.2.0       Vestigial utilities from IPython
ipywidgets               8.0.4       Jupyter interactive widgets
isoduration              20.11.0     Operations with ISO 8601 durations
isort                    5.11.4      A Python utility / library to sort Pyt...
jedi                     0.18.2      An autocompletion tool for Python that...
jinja2                   3.1.2       A very fast and expressive template en...
joblib                   1.2.0       Lightweight pipelining with Python fun...
jsonpointer              2.3         Identify specific nodes in a JSON docu...
jsonschema               4.17.3      An implementation of JSON Schema valid...
jupyter                  1.0.0       Jupyter metapackage. Install all the J...
jupyter-client           7.4.9       Jupyter protocol implementation and cl...
jupyter-console          6.4.4       Jupyter terminal console
jupyter-core             5.1.5       Jupyter core package. A base package o...
jupyter-events           0.6.3       Jupyter Event System library
jupyter-server           2.1.0       The backend—i.e. core services, APIs, ...
jupyter-server-terminals 0.4.4       A Jupyter Server Extension Providing T...
jupyterlab-pygments      0.2.2       Pygments theme using JupyterLab CSS va...
jupyterlab-widgets       3.0.5       Jupyter interactive widgets for Jupyte...
kiwisolver               1.4.4       A fast implementation of the Cassowary...
loguru                   0.6.0       Python logging made (stupidly) simple
markdown                 3.4.1       Python implementation of Markdown.
markdown-it-py           2.1.0       Python port of markdown-it. Markdown p...
markupsafe               2.1.2       Safely add untrusted strings to HTML/X...
matplotlib               3.6.3       Python plotting package
matplotlib-inline        0.1.6       Inline Matplotlib backend for Jupyter
mccabe                   0.7.0       McCabe checker, plugin for flake8
mdurl                    0.1.2       Markdown URL utilities
mistune                  2.0.4       A sane Markdown parser with useful plu...
multidict                6.0.4       multidict implementation
multiprocess             0.70.14     better multiprocessing and multithread...
mypy                     0.991       Optional static typing for Python
mypy-extensions          0.4.3       Experimental type system extensions fo...
nbclassic                0.4.8       A web-based notebook environment for i...
nbclient                 0.7.2       A client library for executing noteboo...
nbconvert                7.2.9       Converting Jupyter Notebooks
nbformat                 5.7.3       The Jupyter Notebook format
nest-asyncio             1.5.6       Patch asyncio to allow nested event loops
notebook                 6.5.2       A web-based notebook environment for i...
notebook-shim            0.2.2       A shim layer for notebook traits and c...
numpy                    1.24.1      Fundamental package for array computin...
oauthlib                 3.2.2       A generic, spec-compliant, thorough im...
packaging                23.0        Core utilities for Python packages
pandas                   1.5.3       Powerful data structures for data anal...
pandocfilters            1.5.0       Utilities for writing pandoc filters i...
parso                    0.8.3       A Python Parser
pathspec                 0.11.0      Utility library for gitignore style pa...
pathtools                0.1.2       File system general utilities
pexpect                  4.8.0       Pexpect allows easy control of interac...
pickleshare              0.7.5       Tiny 'shelve'-like database with concu...
pillow                   9.4.0       Python Imaging Library (Fork)
platformdirs             2.6.2       A small Python package for determining...
pluggy                   1.0.0       plugin and hook calling mechanisms for...
pooch                    1.6.0       "Pooch manages your Python library's s...
prometheus-client        0.16.0      Python client for the Prometheus monit...
prompt-toolkit           3.0.36      Library for building powerful interact...
protobuf                 3.20.3      Protocol Buffers
psutil                   5.9.4       Cross-platform lib for process and sys...
ptyprocess               0.7.0       Run a subprocess in a pseudo terminal
pure-eval                0.2.2       Safely evaluate AST nodes without side...
pyarrow                  10.0.1      Python library for Apache Arrow
pyasn1                   0.4.8       ASN.1 types and codecs
pyasn1-modules           0.2.8       A collection of ASN.1-based protocols ...
pycodestyle              2.10.0      Python style guide checker
pycparser                2.21        C parser in Python
pyflakes                 3.0.1       passive checker of Python programs
pygments                 2.14.0      Pygments is a syntax highlighting pack...
pyparsing                3.0.9       pyparsing module - Classes and methods...
pyrsistent               0.19.3      Persistent/Functional/Immutable data s...
pytdc                    0.3.8       Therapeutics Data Commons
pytest                   7.2.1       pytest: simple powerful testing with P...
pytest-cov               4.0.0       Pytest plugin for measuring coverage.
pytest-mock              3.10.0      Thin-wrapper around the mock package f...
python-dateutil          2.8.2       Extensions to the standard Python date...
python-json-logger       2.0.4       A python library adding a json log for...
pytz                     2022.7.1    World timezone definitions, modern and...
pyyaml                   6.0         YAML parser and emitter for Python
pyzmq                    25.0.0      Python bindings for 0MQ
qtconsole                5.4.0       Jupyter Qt console
qtpy                     2.3.0       Provides an abstraction layer on top o...
rdchiral                 1.1.0       Wrapper for RDKit's RunReactants to im...
rdkit-pypi               2022.9.4    A collection of chemoinformatics and m...
regex                    2022.10.31  Alternative regular expression module,...
requests                 2.28.2      Python HTTP for Humans.
requests-oauthlib        1.3.1       OAuthlib authentication support for Re...
responses                0.18.0      A utility library for mocking out the ...
rfc3339-validator        0.1.4       A pure python RFC3339 validator
rfc3986-validator        0.1.1       Pure python rfc3986 validator
rich                     13.2.0      Render rich text, tables, progress bar...
rsa                      4.9         Pure-Python RSA implementation
scikit-learn             1.2.1       A set of python modules for machine le...
scipy                    1.10.0      Fundamental algorithms for scientific ...
seaborn                  0.12.2      Statistical data visualization
send2trash               1.8.0       Send file to trash natively under Mac ...
sentry-sdk               1.14.0      Python client for Sentry (https://sent...
setproctitle             1.3.2       A Python module to customize the proce...
setuptools               66.1.1      Easily download, build, install, upgra...
six                      1.16.0      Python 2 and 3 compatibility utilities
smmap                    5.0.0       A pure Python implementation of a slid...
sniffio                  1.3.0       Sniff out which async library your cod...
soupsieve                2.3.2.post1 A modern CSS selector implementation f...
stack-data               0.6.2       Extract data from python stack frames ...
subset                   0.1.2       A cli-based word game
tensorboard              2.11.2      TensorBoard lets you watch Tensors Flow
tensorboard-data-server  0.6.1       Fast data loading for TensorBoard
tensorboard-plugin-wit   1.8.1       What-If Tool TensorBoard plugin.
terminado                0.17.1      Tornado websocket backend for the Xter...
threadpoolctl            3.1.0       threadpoolctl
tinycss2                 1.2.1       A tiny CSS parser
tokenizers               0.13.2      Fast and Customizable Tokenizers
tomli                    2.0.1       A lil' TOML parser
torch                    1.13.1      Tensors and Dynamic neural networks in...
torchmetrics             0.11.0      PyTorch native Metrics
tornado                  6.2         Tornado is a Python web framework and ...
tqdm                     4.64.1      Fast, Extensible Progress Meter
traitlets                5.8.1       Traitlets Python configuration system
transformers             4.26.0      State-of-the-art Machine Learning for ...
typing-extensions        4.4.0       Backported and Experimental Type Hints...
uri-template             1.2.0       RFC 6570 URI Template Processor
urllib3                  1.26.14     HTTP library with thread-safe connecti...
wandb                    0.13.9      A CLI and library for interacting with...
wcwidth                  0.2.6       Measures the displayed width of unicod...
webcolors                1.12        A library for working with color names...
webencodings             0.5.1       Character encoding aliases for legacy ...
websocket-client         1.4.2       WebSocket client for Python with low l...
werkzeug                 2.2.2       The comprehensive WSGI web application...
wheel                    0.38.4      A built-package format for Python
widgetsnbextension       4.0.5       Jupyter interactive widgets for Jupyte...
xxhash                   3.2.0       Python binding for xxHash
yarl                     1.8.2       Yet another URL library
zipp                     3.11.0      Backport of pathlib-compatible object ...

santisoler commented 1 year ago

Hi @hogru! Thanks for opening this issue.

The problem is the url you are using it's not pointing to the file in GitHub but to the GitHub page that allows you to download. If you open the downloaded file with a text editor (or use cat from the terminal) you'll see you downloaded an HTML file.

You can easily tell that by looking at the url: if you see blob in there, then it points to the page and not to the file. You want it to say raw instead. For example, this url actually downloads the file: https://github.com/ml-jku/mhn-react/raw/de0fda32f76f866835aa65a6ff857964302b2178/data/USPTO_50k_MHN_prepro.csv.gz

While creating your pooch, you need to use the url that contains the raw, check one of the first example snippets in our docs: https://www.fatiando.org/pooch/latest/sample-data.html#basic-setup

The following snippet should work for you:

odie = pooch.create(
    path="./testdata",
    # base_url="https://github.com/ml-jku/mhn-react/blob/main/data/",
    base_url="https://github.com/ml-jku/mhn-react/raw/de0fda32f76f866835aa65a6ff857964302b2178/data/",
    registry={
        "USPTO_50k_MHN_prepro.csv.gz": None,  # Downloads from github change the hash code every time
    },
)
for file in odie.registry:
    odie.fetch(file, processor=pooch.Decompress())

Let me know if that works for you and I will close this issue. Thanks for reaching out!

hogru commented 1 year ago

Hi @santisoler,

a big thank you for such a quick and thorough response and being kind :-) despite me having the wrong url, also an instance of RTFM ;-) This works of course and also solves the "issue" of the changing hash codes.

I will add a check before fetch() about the file extension (['.xz', '.gz', '.bz2']) to decide whether I need the Decompress().

Thanks again!