Open laserwave opened 4 months ago
When using many image caption data (including weakly annotated), I found that there is no need of CLIP-L anymore. Single SAM can achieve 30% MMVet. I hope this message is helpful to you
If so, isn't it would get stronger performance if with CLIP?
Does SAM trained with vary tiny include caption data can have a good common performance?
You use the negative nature images when training the new vocabulary, did you make a comparison with replacing it with image caption data?