TinyLLaVA / TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models
https://arxiv.org/abs/2402.14289
Apache License 2.0
519 stars 47 forks source link

Reproduced results #82

Open Fantasy1120 opened 1 month ago

Fantasy1120 commented 1 month ago

I try to reproduce the results under base recipe. I basically get the results in the paper on VQAv2, GQA, ScienceQA and POPE. But there is almost 1% gap on TextVQA, MMMU, and MM-Vet, and the gap on MME seems to be larger. I'm not sure if this gap is acceptable? Or what could potentially cause this gap?

reproduce
YingHuTsing commented 1 month ago

Hi, I think this is acceptable. Different number of GPUs cause different gradient_accumulation_steps. Different types of GPUs cause randomness. Btw, the performance for phi-2-siglip-base we listed here is trained by 8 A100-40Gs.

Fantasy1120 commented 1 month ago

Hi, I think this is acceptable. Different number of GPUs cause different gradient_accumulation_steps. Different types of GPUs cause randomness. Btw, the performance for phi-2-siglip-base we listed here is trained by 8 A100-40Gs.

Thanks for your reply. I see that you are using fp16 by default in the training script, but A100 supports bf16. May I ask if you are using the bf16 in your training?

YingHuTsing commented 1 month ago

No, we haven't tried bf16 thoroughly. But we encourage the open-source community to give it a try. And we can update the performance table accordingly and welcome the open-source community as the contributors of this code repository.