ManifoldRG / MultiNet

MIT License
10 stars 1 forks source link

Best practices for managing v0 #61

Open pranavguru opened 1 month ago

pranavguru commented 1 month ago

Investigate and come up with a justification for the best practice for managing large datasets like v0

eihli commented 3 weeks ago

webdataset is looking pretty slick.

The specification is dead simple

https://docs.google.com/document/d/18OdLjruFNX74ILmgrdiCI9J1fQZuhzzRBCHV9URWto0/edit

The tar format ensures compatibility across platforms and allows for easy
creation, manipulation, and extraction of WebDataset files using standard tools.
The file naming conventions enable the grouping of related files into individual
data samples, identified by unique prefixes within the archive. This addresses
the "small file problem" common in deep learning, optimizing I/O and storage
utilization.

Benchmarking shows WDS just slightly behind TFDS.

https://github.com/huggingface/pytorch-image-models/discussions/1524#discussioncomment-4008520

But large datasets with TFDS seem a lot more complex. Requires using Apache Beam? https://www.tensorflow.org/datasets/beam_datasets#implementing_a_beam_dataset