Open sammlapp opened 3 days ago
Hey :)
Thanks for catching that! You’re right—the file size displayed on Hugging Face (HF) only reflects the zipped files and not the extracted ones. We’ll look into updating that to avoid any confusion.
Download process: HF should recognize if a download has already started and automatically continue it. If the download is restarting, it might be due to a change in the download folder path?
Dataset size:
After unpacking, the dataset should be around 993 GB total, with the extracted folder taking up about 510 GB. TBH we haven’t thought about deleting files outside /downloads/extracted
, but a quick check suggests that all paths point to the extracted files. This is a great point! Maybe we can add an automatic cleanup step in the builder script to remove unnecessary files—I’ll explore this further.
HF dataset structure & downloads: Here are a few notes on how HF handles data structure and downloads:
Audio(decode=False)
, which means preprocessing (using map
in HF) without decoding will only update the metadata. Using our example code with save_to_disk
, you’ll need the /downloads/extracted
folder path to load it properly with ds = load_from_disk(dm.disk_save_path)
. Our XCL_processed
folders end up around 6GB, so those downloads need to stay in place.save_to_disk
after decoding, HF saves it in Arrow format. This unpacked data might be larger than the original files. In this case, you could delete the entire /download
folder, though you’d lose the ability to unpack any additional files from the original set.Duplicate downloads: Duplicate downloads can be an issue, and while we’ve attempted to address this in our HF builder script, afaik HF doesn’t allow a better solution right now. To help minimize duplicates, you could:
HSN_scape
without pulling XC files again.save_to_disk
. We’ll look into integrating a simpler subset creation method in BirdSet—thanks for the suggestion!Error in XCM/XCL test dataset: The error is due to needing a different datamodule for XCM/XCL, as these sets don’t include a specific test dataset. This is covered in the “ Reproduce Baselines” section in the docs (that I accidentally removed). Also, you can refer to the configs:
If you still need clarification, feel free to reach out here or by email!
I'm downloading XCL using the following:
Based on the description on HuggingFace I expected 528,434 files, 484 Gb. However, I eventually ran out of storage with the downloaded content hitting over 700 Gb.
Additionally, when I re-started the download, it did not resume downloading but instead started re-downloading into subdirectories with different long random-hexidecimal-character names.
Two questions: (1) what is the full size of the XCL download; and (2) is there a way to avoid duplicate downloads of the same files using this api? This applies not only to if a download gets interrupted, but also to the case of downloading multiple datasets like XCL and PER: ideally they would reference the same set of files on disk rather than have to store an additional copy of the xeno canto files.
Edit: I was able to download the entire XCL after clearing up space. The data_birdset/downloads folder is 986G and data_birdset/downloads/extracted is 502G. Should I now delete the files in
downloads
folder? (are they temporary files that were extracted intodownloads/extracted
?) I'm also unclear on how to use the XCL/XCM datasets in general, is there a script somewhere that demonstrates training on XCL? After the download completes and the train/valid split are created using the code above, I getKeyError: 'test_5s'
which I guess is because this dataset (unlike HSN etc) doesn't contain test data.