DBD-research-group / BirdSet

A benchmark dataset collection for bird sound classification
https://huggingface.co/datasets/DBD-research-group/BirdSet
BSD 3-Clause "New" or "Revised" License
25 stars 8 forks source link

Data download size descrepancy #267

Open sammlapp opened 3 days ago

sammlapp commented 3 days ago

I'm downloading XCL using the following:

from birdset.datamodule import DatasetConfig
from birdset.datamodule.birdset_datamodule import BirdSetDataModule

# download a complete xeno canto snapshot included in BirdSet
# https://huggingface.co/datasets/DBD-research-group/BirdSet

# initiate the data module
dm = BirdSetDataModule(
    dataset=DatasetConfig(
        data_dir=".../data_birdset/",
        hf_path="DBD-research-group/BirdSet",
        hf_name="XCL",
        n_workers=4,
        val_split=0.2,
        task="multilabel",
        classlimit=500,
        eventlimit=5,
        sampling_rate=32000,
    ),
)
# prepare the data (download dataset, ...)
dm.prepare_data()

Based on the description on HuggingFace I expected 528,434 files, 484 Gb. However, I eventually ran out of storage with the downloaded content hitting over 700 Gb.

Additionally, when I re-started the download, it did not resume downloading but instead started re-downloading into subdirectories with different long random-hexidecimal-character names.

Two questions: (1) what is the full size of the XCL download; and (2) is there a way to avoid duplicate downloads of the same files using this api? This applies not only to if a download gets interrupted, but also to the case of downloading multiple datasets like XCL and PER: ideally they would reference the same set of files on disk rather than have to store an additional copy of the xeno canto files.

Edit: I was able to download the entire XCL after clearing up space. The data_birdset/downloads folder is 986G and data_birdset/downloads/extracted is 502G. Should I now delete the files in downloads folder? (are they temporary files that were extracted into downloads/extracted?) I'm also unclear on how to use the XCL/XCM datasets in general, is there a script somewhere that demonstrates training on XCL? After the download completes and the train/valid split are created using the code above, I get KeyError: 'test_5s' which I guess is because this dataset (unlike HSN etc) doesn't contain test data.

Traceback (most recent call last):
  File "/home/sml161/birdset_download/download_XCL.py", line 22, in <module>
    dm.prepare_data()
  File "/home/sml161/BirdSet/birdset/datamodule/base_datamodule.py", line 120, in prepare_data
    dataset = self._preprocess_data(dataset)
  File "/home/sml161/BirdSet/birdset/datamodule/birdset_datamodule.py", line 130, in _preprocess_data
    dataset = DatasetDict({split: dataset[split] for split in ["train", "test_5s"]})
  File "/home/sml161/BirdSet/birdset/datamodule/birdset_datamodule.py", line 130, in <dictcomp>
    dataset = DatasetDict({split: dataset[split] for split in ["train", "test_5s"]})
  File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/dataset_dict.py", line 75, in __getitem__
    return super().__getitem__(k)
KeyError: 'test_5s'
lurauch commented 1 day ago

Hey :)

Thanks for catching that! You’re right—the file size displayed on Hugging Face (HF) only reflects the zipped files and not the extracted ones. We’ll look into updating that to avoid any confusion.

Download process: HF should recognize if a download has already started and automatically continue it. If the download is restarting, it might be due to a change in the download folder path?

Dataset size: After unpacking, the dataset should be around 993 GB total, with the extracted folder taking up about 510 GB. TBH we haven’t thought about deleting files outside /downloads/extracted, but a quick check suggests that all paths point to the extracted files. This is a great point! Maybe we can add an automatic cleanup step in the builder script to remove unnecessary files—I’ll explore this further.

HF dataset structure & downloads: Here are a few notes on how HF handles data structure and downloads:

Duplicate downloads: Duplicate downloads can be an issue, and while we’ve attempted to address this in our HF builder script, afaik HF doesn’t allow a better solution right now. To help minimize duplicates, you could:

  1. Selective downloads: You can download only specific datasets like HSN_scape without pulling XC files again.
  2. Subset creation: If you need the train subsets, a workaround is to first load XCL, then apply a custom mapping function to filter the specific eBird codes for each test set (we created our subsets this way) and saved them with save_to_disk. We’ll look into integrating a simpler subset creation method in BirdSet—thanks for the suggestion!

Error in XCM/XCL test dataset: The error is due to needing a different datamodule for XCM/XCL, as these sets don’t include a specific test dataset. This is covered in the “ Reproduce Baselines” section in the docs (that I accidentally removed). Also, you can refer to the configs:

If you still need clarification, feel free to reach out here or by email!