Using llava to finetune, the result to wrose than siglip, this is unexpected, what's more, it actually can not get any Chinese OCR ability even with Chinese textvqa data.
Why.
In my experiment (the same data as llava 1.5、the same dynamic image cut method as InternVL、but a different LLM), TextVQA and MME surpassed Siglip, yet underperformed on GQA, MMBench CN, and MMStart
Using llava to finetune, the result to wrose than siglip, this is unexpected, what's more, it actually can not get any Chinese OCR ability even with Chinese textvqa data. Why.