BAAI-DCAI / Bunny

A family of lightweight multimodal models.
Apache License 2.0
865 stars 65 forks source link

A question about sampling pre-train data #10

Closed Today0223 closed 6 months ago

Today0223 commented 6 months ago

Hello, ask a question about data sampling.

According to the explanation in the technical report, during the second stage of sampling pretraining data, "sort the remaining samples by the cosine similarity between its text embedding and image embedding and keep samples ranking 40% - 60%".

Why keep the portion ranked between 40% and 60%? Shouldn't the data with higher cosine similarity between text and image embeddings be considered higher quality data?

Isaachhh commented 6 months ago

" We observe that clip_score top 40% is worse than rank 15%-55%, it may be led by that the image-text pairs with too high cosine similarity are "cheater", which means there exist text region in the image which is almost the caption. Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness and T-MARS: Improving Visual Representations by Circumventing Text Feature Learning pay attention to this, too. " ( In the last paragraph of Part 2 here )

Here, we have tried some ranking ranges, 40%-60% works relatively well.