A question about sampling pre-train data

BAAI-DCAI / Bunny

A family of lightweight multimodal models.

Apache License 2.0

865 stars 65 forks source link

" We observe that clip_score top 40% is worse than rank 15%-55%, it may be led by that the image-text pairs with too high cosine similarity are "cheater", which means there exist text region in the image which is almost the caption. Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness and T-MARS: Improving Visual Representations by Circumventing Text Feature Learning pay attention to this, too. " ( In the last paragraph of Part 2 here )

Here, we have tried some ranking ranges, 40%-60% works relatively well.

BAAI-DCAI / Bunny