EleutherAI / the-pile

MIT License
1.51k stars 128 forks source link

Cannot download data , error #108

Open infokng opened 1 year ago

infokng commented 1 year ago

Hi Team

Please refer to the trace of various datasets not being downloaded , i am commenting the ones that fails and try for the next one but it throws the below error . Links to download are working fine but script is not able to find the source , very strange

Finding source for components/stackexchange/stackexchange_dataset.tar
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36.8G/36.8G [5:36:23<00:00, 1.82Mbyte/s]
 85%|█████████████████████████████████████████████████████████████████████████████████████████████████▏                | 31.4G/36.8G [5:45:25<59:40, 1.51Mbyte
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36.8G/36.8G [15:41<00:00, 39.1Mbyte
Traceback (most recent call last):
  File "/mnt/the_pile/utils.py", line 50, in download
    tar_xf(fname)
  File "/mnt/the_pile/utils.py", line 72, in tar_xf
    tf = tarfile.open(x)
  File "/usr/lib/python3.8/tarfile.py", line 1603, in open
    return func(name, "r", fileobj, **kwargs)
  File "/usr/lib/python3.8/tarfile.py", line 1667, in gzopen
    fileobj = GzipFile(name, mode + "b", compresslevel, fileobj)
  File "/usr/lib/python3.8/gzip.py", line 173, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'components/stackexchange/stackexchange_dataset.tar'
Download method [direct] https://the-eye.eu/public/AI/pile_preliminary_components/stackexchange_dataset.tar failed, trying next option
0.00byte [00:07, ?byte/s]
0.00byte [00:07, ?byte/s]
0.00byte [00:09, ?byte/s]
Traceback (most recent call last):
  File "/mnt/the_pile/utils.py", line 50, in download
    tar_xf(fname)
  File "/mnt/the_pile/utils.py", line 72, in tar_xf
    tf = tarfile.open(x)
  File "/usr/lib/python3.8/tarfile.py", line 1603, in open
    return func(name, "r", fileobj, **kwargs)
  File "/usr/lib/python3.8/tarfile.py", line 1667, in gzopen
    fileobj = GzipFile(name, mode + "b", compresslevel, fileobj)
  File "/usr/lib/python3.8/gzip.py", line 173, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'components/stackexchange/stackexchange_dataset.tar'
Download method [direct] http://eaidata.bmk.sh/data/stackexchange_dataset.tar failed, trying next option
Traceback (most recent call last):
  File "the_pile/pile.py", line 360, in <module>
    dset._download()
  File "/mnt/the_pile/datasets.py", line 456, in _download
    download('components/stackexchange/stackexchange_dataset.tar', 'f64f31d20db8d8692c1a019314a14974b4911a34ffef126feaf42da88860c666', [
  File "/mnt/the_pile/utils.py", line 67, in download
    raise Exception('Failed to download {} from any source'.format(fname))
Exception: Failed to download components/stackexchange/stackexchange_dataset.tar from any source

Finding source for components/bookcorpus/books1.tar.gz
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.40G/2.40G [06:38<00:00, 6.03Mbyte/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.40G/2.40G [06:03<00:00, 6.62Mbyte/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.40G/2.40G [05:41<00:00, 7.05Mbyte/s]
Traceback (most recent call last):
  File "/mnt/the_pile/utils.py", line 50, in download
    tar_xf(fname)
  File "/mnt/the_pile/utils.py", line 72, in tar_xf
    tf = tarfile.open(x)
  File "/usr/lib/python3.8/tarfile.py", line 1603, in open
    return func(name, "r", fileobj, **kwargs)
  File "/usr/lib/python3.8/tarfile.py", line 1667, in gzopen
    fileobj = GzipFile(name, mode + "b", compresslevel, fileobj)
  File "/usr/lib/python3.8/gzip.py", line 173, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'components/bookcorpus/books1.tar.gz'
Download method [direct] https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz failed, trying next option
2.40Gbyte [03:22, 11.9Mbyte/s]
2.40Gbyte [03:20, 12.0Mbyte/s]
2.40Gbyte [03:20, 12.0Mbyte/s]
Traceback (most recent call last):
  File "/mnt/the_pile/utils.py", line 50, in download
    tar_xf(fname)
  File "/mnt/the_pile/utils.py", line 72, in tar_xf
    tf = tarfile.open(x)
  File "/usr/lib/python3.8/tarfile.py", line 1603, in open
    return func(name, "r", fileobj, **kwargs)
  File "/usr/lib/python3.8/tarfile.py", line 1667, in gzopen
    fileobj = GzipFile(name, mode + "b", compresslevel, fileobj)
  File "/usr/lib/python3.8/gzip.py", line 173, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'components/bookcorpus/books1.tar.gz'
Download method [direct] http://battle.shawwn.com/sdb/books1/books1.tar.gz failed, trying next option
Traceback (most recent call last):
  File "the_pile/pile.py", line 360, in <module>
    dset._download()
  File "/mnt/the_pile/datasets.py", line 106, in _download
    download('components/bookcorpus/books1.tar.gz', 'e3c993cc825df2bdf0f78ef592f5c09236f0b9cd6bb1877142281acc50f446f9', [
  File "/mnt/the_pile/utils.py", line 67, in download
    raise Exception('Failed to download {} from any source'.format(fname))
Exception: Failed to download components/bookcorpus/books1.tar.gz from any source