Replace vision tower with DINOv2

TinyLLaVA / TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models

https://arxiv.org/abs/2402.14289

Apache License 2.0

658 stars 68 forks source link

Replace vision tower with DINOv2 #68

Closed Daming-W closed 5 months ago

Daming-W commented 5 months ago

Hi team, thanks for you great work！ I am trying to replace vision tower with DINOv2 which is provided in tinyllava-factory scripts. But have anyone evaluted its performance?

YingHuTsing commented 5 months ago

Hi. For vqa benchmarks like SQA and GQA, Dinov2 performs a bit worse than clip vit of the same model size. But we haven't evaluated it on tasks that require fine-grained/dense information, like refcoco.

We recommend you to combine Dinov2 with clip/siglip, because the combined visual encoder would take advantage of both global image-text aligned information from clip/siglip and finegrained information from Dinov2.