Question about LAION-Face 20M subset and selection criteria

FacePerceiver / LAION-Face

The human face subset of LAION-400M for large-scale face pretraining.

274 stars 17 forks source link

Question about LAION-Face 20M subset and selection criteria #9

Closed offchan42 closed 1 year ago

offchan42 commented 1 year ago

I was checking out your LAION-Face repository and I'm really impressed with the amount of image-text pairs it contains. I had a quick question about the 20 million subset mentioned in the README.md file. Can you tell me more about what the subset contains and how it differs from the full-set? Also, I was curious if there were any specific criteria for selecting which images are included in the subset.

Thanks for sharing this amazing resource with the community!

yinglinzheng commented 1 year ago

Hi, the 20M subset is just a random 20M subset of the full-set, nothing special. It's used for fast verification, as the full-set would require more time and resources to train.

offchan42 commented 1 year ago

Thanks. That arbitrariness is useful to know. ANother question is "Is the dataset already decently shuffled?" Can I train with only part 0,1,2,3 and expect this to be representative of the whole dataset? Or do I need to randomly sample parts to use? Is random sampling requirement the reason why 20M subset is not 0,1,2,3,4...,19 ?

yinglinzheng commented 1 year ago

That's just random chosen indices, I guess it would not affect the performance, but there is no experiements to prove that.

offchan42 commented 1 year ago

OK. Thanks. I guess random sampling is good enough for making sure that it's representative of the whole dataset. Because it doesn't hurt anyway to do random sampling.