Closed offchan42 closed 1 year ago
Hi, the 20M subset is just a random 20M subset of the full-set, nothing special. It's used for fast verification, as the full-set would require more time and resources to train.
Thanks. That arbitrariness is useful to know. ANother question is "Is the dataset already decently shuffled?" Can I train with only part 0,1,2,3 and expect this to be representative of the whole dataset? Or do I need to randomly sample parts to use? Is random sampling requirement the reason why 20M subset is not 0,1,2,3,4...,19 ?
That's just random chosen indices, I guess it would not affect the performance, but there is no experiements to prove that.
OK. Thanks. I guess random sampling is good enough for making sure that it's representative of the whole dataset. Because it doesn't hurt anyway to do random sampling.
I was checking out your LAION-Face repository and I'm really impressed with the amount of image-text pairs it contains. I had a quick question about the 20 million subset mentioned in the README.md file. Can you tell me more about what the subset contains and how it differs from the full-set? Also, I was curious if there were any specific criteria for selecting which images are included in the subset.
Thanks for sharing this amazing resource with the community!