Data download size descrepancy

DBD-research-group / BirdSet

A benchmark dataset collection for bird sound classification

BSD 3-Clause "New" or "Revised" License

25 stars 8 forks source link

from birdset.datamodule import DatasetConfig from birdset.datamodule.birdset_datamodule import BirdSetDataModule # download a complete xeno canto snapshot included in BirdSet # https://huggingface.co/datasets/DBD-research-group/BirdSet # initiate the data module dm = BirdSetDataModule( dataset=DatasetConfig( data_dir=".../data_birdset/", hf_path="DBD-research-group/BirdSet", hf_name="XCL", n_workers=4, val_split=0.2, task="multilabel", classlimit=500, eventlimit=5, sampling_rate=32000, ), ) # prepare the data (download dataset, ...) dm.prepare_data()

Traceback (most recent call last): File "/home/sml161/birdset_download/download_XCL.py", line 22, in <module> dm.prepare_data() File "/home/sml161/BirdSet/birdset/datamodule/base_datamodule.py", line 120, in prepare_data dataset = self._preprocess_data(dataset) File "/home/sml161/BirdSet/birdset/datamodule/birdset_datamodule.py", line 130, in _preprocess_data dataset = DatasetDict({split: dataset[split] for split in ["train", "test_5s"]}) File "/home/sml161/BirdSet/birdset/datamodule/birdset_datamodule.py", line 130, in <dictcomp> dataset = DatasetDict({split: dataset[split] for split in ["train", "test_5s"]}) File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/dataset_dict.py", line 75, in __getitem__ return super().__getitem__(k) KeyError: 'test_5s'

Hey :)

Thanks for catching that! You’re right—the file size displayed on Hugging Face (HF) only reflects the zipped files and not the extracted ones. We’ll look into updating that to avoid any confusion.

Download process: HF should recognize if a download has already started and automatically continue it. If the download is restarting, it might be due to a change in the download folder path?

Dataset size: After unpacking, the dataset should be around 993 GB total, with the extracted folder taking up about 510 GB. TBH we haven’t thought about deleting files outside /downloads/extracted, but a quick check suggests that all paths point to the extracted files. This is a great point! Maybe we can add an automatic cleanup step in the builder script to remove unnecessary files—I’ll explore this further.

HF dataset structure & downloads: Here are a few notes on how HF handles data structure and downloads:

When you download the dataset, HF automatically saves/unpacks it into subdirectories based on HF’s naming conventions.
Our data natively includes Audio(decode=False), which means preprocessing (using map in HF) without decoding will only update the metadata. Using our example code with save_to_disk, you’ll need the /downloads/extracted folder path to load it properly with ds = load_from_disk(dm.disk_save_path). Our XCL_processed folders end up around 6GB, so those downloads need to stay in place.
If you choose to save_to_disk after decoding, HF saves it in Arrow format. This unpacked data might be larger than the original files. In this case, you could delete the entire /download folder, though you’d lose the ability to unpack any additional files from the original set.

Duplicate downloads: Duplicate downloads can be an issue, and while we’ve attempted to address this in our HF builder script, afaik HF doesn’t allow a better solution right now. To help minimize duplicates, you could:

Selective downloads: You can download only specific datasets like HSN_scape without pulling XC files again.
Subset creation: If you need the train subsets, a workaround is to first load XCL, then apply a custom mapping function to filter the specific eBird codes for each test set (we created our subsets this way) and saved them with save_to_disk. We’ll look into integrating a simpler subset creation method in BirdSet—thanks for the suggestion!

Error in XCM/XCL test dataset: The error is due to needing a different datamodule for XCM/XCL, as these sets don’t include a specific test dataset. This is covered in the “ Reproduce Baselines” section in the docs (that I accidentally removed). Also, you can refer to the configs:

Datamodule config: XCL.yaml
Experiment configs: BirdSet NeurIPS24

If you still need clarification, feel free to reach out here or by email!

DBD-research-group / BirdSet

Data download size descrepancy #267