EthanRosenthal / medium-data-bakeoff

A python library bakeoff for medium sized datasets
MIT License
23 stars 7 forks source link

BadZipFile: Bad CRC-32 for file 'citi_bike_data_00001.csv' #10

Open stoneyv opened 1 year ago

stoneyv commented 1 year ago

The make-dataset returned a bad CRC-32 error. I will try to manually download the data and convert it.

(medium-data-bakeoff-py3.9) stoney@laptop2:~/Desktop/medium-data-bakeoff/src$ python -m medium_data_bakeoff make-dataset
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/stoney/.kaggle/kaggle.json'
2022-11-14 22:11:15.730 | INFO     | medium_data_bakeoff.data:construct_dataset:64 - Downloading 'rosenthal/citi-bike-stations' dataset from kaggle.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/stoney/.cache/pypoetry/virtualenvs/medium-data-bakeoff-2erTjUlV-py3.9/lib/python3.9/site-p │
│ ackages/kaggle/api/kaggle_api_extended.py:1246 in dataset_download_files                         │
│                                                                                                  │
│   1243 │   │   │   if unzip:                                                                     │
│   1244 │   │   │   │   try:                                                                      │
│   1245 │   │   │   │   │   with zipfile.ZipFile(outfile) as z:                                   │
│ ❱ 1246 │   │   │   │   │   │   z.extractall(effective_path)                                      │
│   1247 │   │   │   │   except zipfile.BadZipFile as e:                                           │
│   1248 │   │   │   │   │   raise ValueError(                                                     │
│   1249 │   │   │   │   │   │   'Bad zip file, please report on '                                 │
│                                                                                                  │
│ ╭────────────────────────────────────────── locals ───────────────────────────────────────────╮  │
│ │        dataset = 'rosenthal/citi-bike-stations'                                             │  │
│ │   dataset_slug = 'citi-bike-stations'                                                       │  │
│ │   dataset_urls = ['rosenthal', 'citi-bike-stations']                                        │  │
│ │     downloaded = True                                                                       │  │
│ │ effective_path = PosixPath('/home/stoney/Desktop/medium-data-bakeoff/data/csv')             │  │
│ │          force = False                                                                      │  │
│ │        outfile = '/home/stoney/Desktop/medium-data-bakeoff/data/csv/citi-bike-stations.zip' │  │
│ │     owner_slug = 'rosenthal'                                                                │  │
│ │           path = PosixPath('/home/stoney/Desktop/medium-data-bakeoff/data/csv')             │  │
│ │          quiet = True                                                                       │  │
│ │       response = <urllib3.response.HTTPResponse object at 0x7fb5beb1f3a0>                   │  │
│ │           self = <kaggle.api.kaggle_api_extended.KaggleApi object at 0x7fb5bed08850>        │  │
│ │          unzip = True                                                                       │  │
│ │              z = <zipfile.ZipFile [closed]>                                                 │  │
│ ╰─────────────────────────────────────────────────────────────────────────────────────────────╯  │
│                                                                                                  │
│ /home/stoney/.pyenv/versions/3.9.5/lib/python3.9/zipfile.py:1633 in extractall                   │
│                                                                                                  │
│   1630 │   │   │   path = os.fspath(path)                                                        │
│   1631 │   │                                                                                     │
│   1632 │   │   for zipinfo in members:                                                           │
│ ❱ 1633 │   │   │   self._extract_member(zipinfo, path, pwd)                                      │
│   1634 │                                                                                         │
│   1635 │   @classmethod                                                                          │
│   1636 │   def _sanitize_windows_name(cls, arcname, pathsep):                                    │
│                                                                                                  │
│ ╭─────────────────────────── locals ────────────────────────────╮                                │
│ │ members = [                                                   │                                │
│ │           │   'citi_bike_data_00000.csv',                     │                                │
│ │           │   'citi_bike_data_00001.csv',                     │                                │
│ │           │   'citi_bike_data_00002.csv',                     │                                │
│ │           │   'citi_bike_data_00003.csv',                     │                                │
│ │           │   'citi_bike_data_00004.csv',                     │                                │
│ │           │   'citi_bike_data_00005.csv',                     │                                │
│ │           │   'citi_bike_data_00006.csv',                     │                                │
│ │           │   'citi_bike_data_00007.csv',                     │                                │
│ │           │   'citi_bike_data_00008.csv',                     │                                │
│ │           │   'citi_bike_data_00009.csv',                     │                                │
│ │           │   ... +40                                         │                                │
│ │           ]                                                   │                                │
│ │    path = '/home/stoney/Desktop/medium-data-bakeoff/data/csv' │                                │
│ │     pwd = None                                                │                                │
│ │    self = <zipfile.ZipFile [closed]>                          │                                │
│ │ zipinfo = 'citi_bike_data_00001.csv'                          │                                │
│ ╰───────────────────────────────────────────────────────────────╯                                │
│                                                                                                  │
│ /home/stoney/.pyenv/versions/3.9.5/lib/python3.9/zipfile.py:1688 in _extract_member              │
│                                                                                                  │
│   1685 │   │                                                                                     │
│   1686 │   │   with self.open(member, pwd=pwd) as source, \                                      │
│   1687 │   │   │    open(targetpath, "wb") as target:                                            │
│ ❱ 1688 │   │   │   shutil.copyfileobj(source, target)                                            │
│   1689 │   │                                                                                     │
│   1690 │   │   return targetpath                                                                 │
│   1691                                                                                           │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │            arcname = 'citi_bike_data_00001.csv'                                              │ │
│ │ invalid_path_parts = ('', '.', '..')                                                         │ │
│ │             member = <ZipInfo filename='citi_bike_data_00001.csv' compress_type=deflate      │ │
│ │                      file_size=383898052 compress_size=58998990>                             │ │
│ │                pwd = None                                                                    │ │
│ │               self = <zipfile.ZipFile [closed]>                                              │ │
│ │             source = <zipfile.ZipExtFile [closed]>                                           │ │
│ │             target = <_io.BufferedWriter                                                     │ │
│ │                      name='/home/stoney/Desktop/medium-data-bakeoff/data/csv/citi_bike_data… │ │
│ │         targetpath = '/home/stoney/Desktop/medium-data-bakeoff/data/csv/citi_bike_data_0000… │ │
│ │          upperdirs = '/home/stoney/Desktop/medium-data-bakeoff/data/csv'                     │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /home/stoney/.pyenv/versions/3.9.5/lib/python3.9/shutil.py:205 in copyfileobj                    │
│                                                                                                  │
│    202 │   fsrc_read = fsrc.read                                                                 │
│    203 │   fdst_write = fdst.write                                                               │
│    204 │   while True:                                                                           │
│ ❱  205 │   │   buf = fsrc_read(length)                                                           │
│    206 │   │   if not buf:                                                                       │
│    207 │   │   │   break                                                                         │
│    208 │   │   fdst_write(buf)                                                                   │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │        buf = b',043450N63715450N63715\\N,0,30406,04345027,715450N637151496218044,043450N637… │ │
│ │       fdst = <_io.BufferedWriter                                                             │ │
│ │              name='/home/stoney/Desktop/medium-data-bakeoff/data/csv/citi_bike_data_00001.c… │ │
│ │ fdst_write = <built-in method write of _io.BufferedWriter object at 0x7fb5beb2bb40>          │ │
│ │       fsrc = <zipfile.ZipExtFile [closed]>                                                   │ │
│ │  fsrc_read = <bound method ZipExtFile.read of <zipfile.ZipExtFile [closed]>>                 │ │
│ │     length = 65536                                                                           │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /home/stoney/.pyenv/versions/3.9.5/lib/python3.9/zipfile.py:922 in read                          │
│                                                                                                  │
│    919 │   │   self._readbuffer = b''                                                            │
│    920 │   │   self._offset = 0                                                                  │
│    921 │   │   while n > 0 and not self._eof:                                                    │
│ ❱  922 │   │   │   data = self._read1(n)                                                         │
│    923 │   │   │   if n < len(data):                                                             │
│    924 │   │   │   │   self._readbuffer = data                                                   │
│    925 │   │   │   │   self._offset = n                                                          │
│                                                                                                  │
│ ╭─────────────── locals ───────────────╮                                                         │
│ │  buf = b''                           │                                                         │
│ │  end = 65536                         │                                                         │
│ │    n = 65536                         │                                                         │
│ │ self = <zipfile.ZipExtFile [closed]> │                                                         │
│ ╰──────────────────────────────────────╯                                                         │
│                                                                                                  │
│ /home/stoney/.pyenv/versions/3.9.5/lib/python3.9/zipfile.py:1012 in _read1                       │
│                                                                                                  │
│   1009 │   │   self._left -= len(data)                                                           │
│   1010 │   │   if self._left <= 0:                                                               │
│   1011 │   │   │   self._eof = True                                                              │
│ ❱ 1012 │   │   self._update_crc(data)                                                            │
│   1013 │   │   return data                                                                       │
│   1014 │                                                                                         │
│   1015 │   def _read2(self, n):                                                                  │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ data = b'0,304015,0,1,17,0,1,1,1,1605744509,Broadway & Battery                               │ │
│ │        Pl,18270463334,-74.0136170'+53620                                                     │ │
│ │    n = 65536                                                                                 │ │
│ │ self = <zipfile.ZipExtFile [closed]>                                                         │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /home/stoney/.pyenv/versions/3.9.5/lib/python3.9/zipfile.py:940 in _update_crc                   │
│                                                                                                  │
│    937 │   │   self._running_crc = crc32(newdata, self._running_crc)                             │
│    938 │   │   # Check the CRC if we're at the end of the file                                   │
│    939 │   │   if self._eof and self._running_crc != self._expected_crc:                         │
│ ❱  940 │   │   │   raise BadZipFile("Bad CRC-32 for file %r" % self.name)                        │
│    941 │                                                                                         │
│    942 │   def read1(self, n):                                                                   │
│    943 │   │   """Read up to n bytes with at most one read() system call."""                     │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ newdata = b'0,304015,0,1,17,0,1,1,1,1605744509,Broadway & Battery                            │ │
│ │           Pl,18270463334,-74.0136170'+53620                                                  │ │
│ │    self = <zipfile.ZipExtFile [closed]>                                                      │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
BadZipFile: Bad CRC-32 for file 'citi_bike_data_00001.csv'

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/stoney/Desktop/medium-data-bakeoff/src/medium_data_bakeoff/cli.py:10 in make_dataset       │
│                                                                                                  │
│    7 def make_dataset() -> None:                                                                 │
│    8 │   from medium_data_bakeoff.data import construct_dataset                                  │
│    9 │                                                                                           │
│ ❱ 10 │   construct_dataset()                                                                     │
│   11                                                                                             │
│   12                                                                                             │
│   13 @app.command(                                                                               │
│                                                                                                  │
│ ╭────────────────────────────── locals ──────────────────────────────╮                           │
│ │ construct_dataset = <function construct_dataset at 0x7fb5bebcb820> │                           │
│ ╰────────────────────────────────────────────────────────────────────╯                           │
│                                                                                                  │
│ /home/stoney/Desktop/medium-data-bakeoff/src/medium_data_bakeoff/data.py:66 in construct_dataset │
│                                                                                                  │
│    63 │   # Download the dataset from Kaggle                                                     │
│    64 │   logger.info("Downloading {!r} dataset from kaggle.", config.KAGGLE_DATASET)            │
│    65 │   config.CSV_PATH.mkdir(parents=True, exist_ok=True)                                     │
│ ❱  66 │   kaggle.api.dataset_download_files(                                                     │
│    67 │   │   config.KAGGLE_DATASET, config.CSV_PATH, unzip=True                                 │
│    68 │   )                                                                                      │
│    69                                                                                            │
│                                                                                                  │
│ /home/stoney/.cache/pypoetry/virtualenvs/medium-data-bakeoff-2erTjUlV-py3.9/lib/python3.9/site-p │
│ ackages/kaggle/api/kaggle_api_extended.py:1248 in dataset_download_files                         │
│                                                                                                  │
│   1245 │   │   │   │   │   with zipfile.ZipFile(outfile) as z:                                   │
│   1246 │   │   │   │   │   │   z.extractall(effective_path)                                      │
│   1247 │   │   │   │   except zipfile.BadZipFile as e:                                           │
│ ❱ 1248 │   │   │   │   │   raise ValueError(                                                     │
│   1249 │   │   │   │   │   │   'Bad zip file, please report on '                                 │
│   1250 │   │   │   │   │   │   'www.github.com/kaggle/kaggle-api', e)                            │
│   1251                                                                                           │
│                                                                                                  │
│ ╭────────────────────────────────────────── locals ───────────────────────────────────────────╮  │
│ │        dataset = 'rosenthal/citi-bike-stations'                                             │  │
│ │   dataset_slug = 'citi-bike-stations'                                                       │  │
│ │   dataset_urls = ['rosenthal', 'citi-bike-stations']                                        │  │
│ │     downloaded = True                                                                       │  │
│ │ effective_path = PosixPath('/home/stoney/Desktop/medium-data-bakeoff/data/csv')             │  │
│ │          force = False                                                                      │  │
│ │        outfile = '/home/stoney/Desktop/medium-data-bakeoff/data/csv/citi-bike-stations.zip' │  │
│ │     owner_slug = 'rosenthal'                                                                │  │
│ │           path = PosixPath('/home/stoney/Desktop/medium-data-bakeoff/data/csv')             │  │
│ │          quiet = True                                                                       │  │
│ │       response = <urllib3.response.HTTPResponse object at 0x7fb5beb1f3a0>                   │  │
│ │           self = <kaggle.api.kaggle_api_extended.KaggleApi object at 0x7fb5bed08850>        │  │
│ │          unzip = True                                                                       │  │
│ │              z = <zipfile.ZipFile [closed]>                                                 │  │
│ ╰─────────────────────────────────────────────────────────────────────────────────────────────╯  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: ('Bad zip file, please report on www.github.com/kaggle/kaggle-api', BadZipFile("Bad CRC-32 for file 
'citi_bike_data_00001.csv'"))
EthanRosenthal commented 1 year ago

Weird, I have not run into this before. I honestly have no idea what's going on. I believe that contruct-dataset first downloads a single large zip file from Kaggle into ./data/ and then unzips that file. As you mentioned, maybe manually downloading and unzipping yourself will work? Let me know what you find out.