dask / dask

Parallel computing with task scheduling
https://dask.org
BSD 3-Clause "New" or "Revised" License
12.41k stars 1.7k forks source link

zipfile.BadZipFile: Overlapped entries (possible zip bomb) #11358

Open leonardozilli opened 2 weeks ago

leonardozilli commented 2 weeks ago

Describe the issue: I have a codebase i built months ago in which i used Dask to process a dataset made of dozens of .csv files, each contained in a collection of different zipped folders. The code worked without problems with Python 3.11.5, but since upgrading to 3.12 i get the following error when trying to read the csvs:

zipfile.BadZipFile: Overlapped entries: '2023-08-04T181624_0_1.csv' (possible zip bomb)

the dataset comes from here: https://figshare.com/articles/dataset/OpenCitations_Index_CSV_dataset_of_all_the_citation_data/24356626/2

Minimal Complete Verifiable Example:

# unzip the internal archives
if index_path.endswith('.zip'):
    extraction_dir = index_path.replace('.zip', '')
    with ZipFile(index_path, 'r') as zip_ref:
        zip_ref.extractall(extraction_dir)
    index_path = extraction_dir

file_names = [Path(index_path) / Path(archive) for archive in os.listdir(index_path)]

for archive in tqdm(file_names):
    zip_file = ZipFile(archive)

    csvs = ['zip://'+n for n in zip_file.namelist() if n.endswith('.csv')]

    ddf = dd.read_csv(csvs, storage_options={'fo': zip_file.filename}, usecols=['id', 'citing', 'cited'])
    ... process the dataframe ...
    ddf.to_parquet(archive.stem, write_index=False)

Environment:

hendrikmakait commented 2 weeks ago

Thanks for the report. At first glance, it looks like you might be facing a problem with zipfile and not with Dask.

If your problem is related to Dask, please provide more information about what caused this:

This helps us reproduce the issue you're having and resolve the issue more quickly.