The result about llava-phi

TinyLLaVA / TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models

https://arxiv.org/abs/2402.14289

Apache License 2.0

658 stars 68 forks source link

The result about llava-phi #9

Closed xushilin1 closed 8 months ago

xushilin1 commented 8 months ago

Hi, I observed that the results in Fig. 7(C) were obtained from training with the LLaVA dataset using the base recipe. However, these results are notably higher than those reported in this paper (https://arxiv.org/pdf/2401.02330.pdf), and I have been unable to replicate your findings.

I am curious if there are any differences between our approaches.

baichuanzhou commented 8 months ago

LLaVA-Phi's training setup is different from ours in the following way:

They used LLaVA-Instruct-150K instead of the full LLaVA-665K data during SFT.
Hyper-parameters of the finetuning stage is different from ours. Our batch size is 128 during SFT, while theirs is 256. We also did not apply weight decay.
They finetuned Phi-2, and we did not.

Training scripts and descriptions are expected to come out this weekend, but if you have more questions or concerns, feel free to leave comment here (or email).🙂

xushilin1 commented 8 months ago

Thank you for your response and the excellent open-source work. Could you please send me the pre-training and fine-tuning scripts for LLaVA-Phi under the base training recipe? You can reach me at xsl.cmd@gmail.com. Thanks.

xushilin1 commented 8 months ago

I retrained TinyLlama and Phi-2 using the official LLaVA-1.5 pre-training and fine-tuning scripts, following the conversation mode you provided. However, I achieved TextVQA scores of only 42.8 for TinyLlama and 35.3 for Phi-2, which are lower than the scores of 45.8 and 51.4 reported in your paper.

xushilin1 commented 8 months ago

What is the version of transformers you used for training?

baichuanzhou commented 8 months ago

I retrained TinyLlama and Phi-2 using the official LLaVA-1.5 pre-training and fine-tuning scripts, following the conversation mode you provided. However, I achieved TextVQA scores of only 42.8 for TinyLlama and 35.3 for Phi-2, which are lower than the scores of 45.8 and 51.4 reported in your paper.

That is very odd.

baichuanzhou commented 8 months ago

We trained our model using transformers 4.37.2. Make sure you are using our codebase for better reproducibility.

baichuanzhou commented 8 months ago

Thank you for your response and the excellent open-source work. Could you please send me the pre-training and fine-tuning scripts for LLaVA-Phi under the base training recipe? You can reach me at xsl.cmd@gmail.com. Thanks.

I've emailed you the raw training scripts. Please let me know how it goes.

xushilin1 commented 8 months ago

I obtained the TextVQA result of 51.4 as reported in your paper using the provided script. Thank you a lot. There are some different settings during pre-train and finetune resulting in a large performance drop (36.1 vs 51.4).

You use zero3 for both pretraining and finetuning, while I use zeor2 for pretraining and zero3 for finetuning
You set tf32 to false and I set it to true for both stages.
You use fp16 and I use bf16 for both stages.

I don't understand why these three differences have such a big impact on the results. Have you ever encountered the same thing?

xushilin1 commented 8 months ago

After experimental verification, using fp16 instead of bf16 has a significant impact on phi-2.

baichuanzhou commented 8 months ago

We observed simliar results using bf16. I think it's because phi-2's original weights are released in fp16, and converting them to bf16 will cause some unexpected behaviors under the vlm framework. For example, during pretraining, we observed that it was very hard for phi-2 to converge when using bf16.

luca-medeiros commented 8 months ago

@xushilin1 @baichuanzhou could you share the scripts? lucamedeiros@outlook.com