Closed Daming-W closed 5 months ago
Hi. For vqa benchmarks like SQA and GQA, Dinov2 performs a bit worse than clip vit of the same model size. But we haven't evaluated it on tasks that require fine-grained/dense information, like refcoco.
We recommend you to combine Dinov2 with clip/siglip, because the combined visual encoder would take advantage of both global image-text aligned information from clip/siglip and finegrained information from Dinov2.
Hi team, thanks for you great work! I am trying to replace vision tower with DINOv2 which is provided in tinyllava-factory scripts. But have anyone evaluted its performance?