Open shuo-yan20 opened 4 months ago
it's supposed to be similar as webdataset.
To allow 100% transparency, our sample dataloader reads it via regular python tar api, the tar file is organized as <dataset_dir>/{shard_id % 100}/{shard_id}.tar
.
Each tar contains files in the following order:
uuid1.json
uuid1.jpeg
uuid2.json
uuid2.jpeg
Thank you for your excellent work!
In
metaclip/pipeline.py
, I find the the functionshard_text_loader
parsing the.tar
format data, including finding.jpeg
and.json
. I want to kown how these.tar
data were organized, and why image data of.jpeg
has been downloaded before sub_matching?Thanks very much!