EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.
https://epistasislab.github.io/pmlb/
MIT License
805 stars 135 forks source link

Datasets no longer accessible. #168

Closed CharlesEdisonTripp closed 1 year ago

CharlesEdisonTripp commented 1 year ago

Hello,

It appears that pmlb datasets are no longer accessible. Most likely this is because the repository ran out of LFS bandwidth quota.

Here are three ways to reproduce the error:

git clone https://github.com/EpistasisLab/pmlb.git pmlb
Cloning into 'pmlb'...
remote: Enumerating objects: 15906, done.
remote: Counting objects: 100% (2171/2171), done.
remote: Compressing objects: 100% (863/863), done.
remote: Total 15906 (delta 1273), reused 2124 (delta 1246), pack-reused 13735
Receiving objects: 100% (15906/15906), 238.71 MiB | 7.40 MiB/s, done.
Resolving deltas: 100% (9915/9915), done.
Downloading datasets/1027_ESL/1027_ESL.tsv.gz (1.5 KB)
Error downloading object: datasets/1027_ESL/1027_ESL.tsv.gz (cb678bb): Smudge error: Error downloading datasets/1027_ESL/1027_ESL.tsv.gz (cb678bbf79f2daf10d64ecbafab7019ead486dad2e925ba4bdab445a4886fc79): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Errors logged to /home/ctripp/project/dmp/src/pmlb/.git/lfs/logs/20230114T012350.023068407.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: datasets/1027_ESL/1027_ESL.tsv.gz: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'
>>> raw_inputs, raw_outputs = pmlb.fetch_data('201_pol', return_X_y=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/site-packages/pmlb/pmlb.py", line 68, in fetch_data
    dataset = pd.read_csv(dataset_url, sep='\t', compression='gzip')
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 933, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1231, in _make_engine
    return mapping[engine](f, **self.options)
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/gzip.py", line 487, in read
    if not self._read_gzip_header():
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/gzip.py", line 435, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b've')
>>> pandas.read_csv('https://github.com/EpistasisLab/pmlb/raw/master/datasets/201_pol/201_pol.tsv.gz')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 933, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1231, in _make_engine
    return mapping[engine](f, **self.options)
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/gzip.py", line 487, in read
    if not self._read_gzip_header():
  File "/home/ctripp/project/dmp/env/dmp/lib/python3.9/gzip.py", line 435, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b've')

Thank you. This repository has played a starring role in my research efforts.

463539713 commented 1 year ago

I also encountered this problem, when I cleared the cache and tried to run again, I couldn't get the data anymore 😂

trangdata commented 1 year ago

@JDRomano2 @lacava Could you two take a look, please? I'm away till the end of the month. Thank you both! 🌈

dhimmel commented 1 year ago

Might need to renew education vouchers. Someone with billing access @EpistasisLab probably needed to see the current quota and issue.

PaulWangCS commented 1 year ago

@CharlesEdisonTripp Problem solved.

dhimmel commented 1 year ago

Problem solved

Nice. Out of curiosity, what was the solution? (in case the problem arises again)

PaulWangCS commented 1 year ago

Problem solved

Nice. Out of curiosity, what was the solution? (in case the problem arises again)

we had one LFS data pack which only allows 50G data traffic. Now we raised it to 3 packs.