Why use InternVL2 as the caption model？

VectorSpaceLab / OmniGen

OmniGen: Unified Image Generation. https://arxiv.org/pdf/2409.11340

MIT License

2.83k stars 219 forks source link

Why use InternVL2 as the caption model？ #126

Closed JoshonSmith closed 1 week ago

JoshonSmith commented 1 week ago

great work！ Why use InternVL2 as the caption model？ Does InternVL2 work best in the experimental phase?

staoxiao commented 1 week ago

Thanks for your attention to our work! At the start of this project, InternVLM was one of the top-ranked models in multi-modal understanding benchmark at the time, so we chose it.