cambrian-mllm / cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
https://cambrian-mllm.github.io/
Apache License 2.0
1.77k stars 116 forks source link

RADIOv2.5 as vision encoder #65

Open gheinrich opened 4 months ago

gheinrich commented 4 months ago

Hello,

Congratulations on a great project! I enjoyed reading your paper, where you clearly articulate the motivation behind each design choice. Your results are amazing!

Have you considered using RADIO as a vision encoder? We recently released version 2.5 of this vision foundation model, and our LLaVA 1.5 results look great, surpassing other vision encoders we've tried by a good margin. We believe that RADIO would be an excellent addition to your blend of vision encoders. RADIOv2.5-L is a ViT-L/16 and is very flexible, supporting input resolutions up to 2048x2048.

You can pull RADIO using either TorchHub or HuggingFace. We believe it's easy to integrate, but if you need any help, @mranzinger and I are here to assist!

Thanks in advance!