Congratulations on a great project! I enjoyed reading your paper, where you clearly articulate the motivation behind each design choice. Your results are amazing!
Have you considered using RADIO as a vision encoder? We recently released version 2.5 of this vision foundation model, and our LLaVA 1.5 results look great, surpassing other vision encoders we've tried by a good margin. We believe that RADIO would be an excellent addition to your blend of vision encoders. RADIOv2.5-L is a ViT-L/16 and is very flexible, supporting input resolutions up to 2048x2048.
You can pull RADIO using either TorchHub or HuggingFace. We believe it's easy to integrate, but if you need any help, @mranzinger and I are here to assist!
Hello,
Congratulations on a great project! I enjoyed reading your paper, where you clearly articulate the motivation behind each design choice. Your results are amazing!
Have you considered using RADIO as a vision encoder? We recently released version 2.5 of this vision foundation model, and our LLaVA 1.5 results look great, surpassing other vision encoders we've tried by a good margin. We believe that RADIO would be an excellent addition to your blend of vision encoders. RADIOv2.5-L is a ViT-L/16 and is very flexible, supporting input resolutions up to 2048x2048.
You can pull RADIO using either TorchHub or HuggingFace. We believe it's easy to integrate, but if you need any help, @mranzinger and I are here to assist!
Thanks in advance!