Open pranavguru opened 1 month ago
webdataset is looking pretty slick.
The specification is dead simple
https://docs.google.com/document/d/18OdLjruFNX74ILmgrdiCI9J1fQZuhzzRBCHV9URWto0/edit
The tar format ensures compatibility across platforms and allows for easy
creation, manipulation, and extraction of WebDataset files using standard tools.
The file naming conventions enable the grouping of related files into individual
data samples, identified by unique prefixes within the archive. This addresses
the "small file problem" common in deep learning, optimizing I/O and storage
utilization.
Benchmarking shows WDS just slightly behind TFDS.
https://github.com/huggingface/pytorch-image-models/discussions/1524#discussioncomment-4008520
But large datasets with TFDS seem a lot more complex. Requires using Apache Beam? https://www.tensorflow.org/datasets/beam_datasets#implementing_a_beam_dataset
Investigate and come up with a justification for the best practice for managing large datasets like v0