Ucas-HaoranWei / Vary

[ECCV2024] Official code implementation of Vary: Scaling Up the Vision Vocabulary of Large Vision Language Models.
1.65k stars 150 forks source link

negative nature images #73

Open laserwave opened 4 months ago

laserwave commented 4 months ago

You use the negative nature images when training the new vocabulary, did you make a comparison with replacing it with image caption data?

Ucas-HaoranWei commented 4 months ago

When using many image caption data (including weakly annotated), I found that there is no need of CLIP-L anymore. Single SAM can achieve 30% MMVet. I hope this message is helpful to you

lucasjinreal commented 3 months ago

If so, isn't it would get stronger performance if with CLIP?

Does SAM trained with vary tiny include caption data can have a good common performance?