How are datasets organized in .tar format?

facebookresearch / MetaCLIP

ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Experts via Clustering

Other

1.24k stars 54 forks source link

How are datasets organized in .tar format? #59

Open shuo-yan20 opened 4 months ago

shuo-yan20 commented 4 months ago

Thank you for your excellent work！

In metaclip/pipeline.py, I find the the function shard_text_loader parsing the .tar format data, including finding .jpeg and .json. I want to kown how these .tar data were organized, and why image data of .jpeg has been downloaded before sub_matching?

Thanks very much!

howardhsu commented 2 months ago

it's supposed to be similar as webdataset. To allow 100% transparency, our sample dataloader reads it via regular python tar api, the tar file is organized as <dataset_dir>/{shard_id % 100}/{shard_id}.tar.

Each tar contains files in the following order:

     uuid1.json
     uuid1.jpeg
     uuid2.json
     uuid2.jpeg