Closed xushilin1 closed 8 months ago
LLaVA-Phi's training setup is different from ours in the following way:
Training scripts and descriptions are expected to come out this weekend, but if you have more questions or concerns, feel free to leave comment here (or email).🙂
Thank you for your response and the excellent open-source work. Could you please send me the pre-training and fine-tuning scripts for LLaVA-Phi under the base training recipe? You can reach me at xsl.cmd@gmail.com. Thanks.
I retrained TinyLlama and Phi-2 using the official LLaVA-1.5 pre-training and fine-tuning scripts, following the conversation mode you provided. However, I achieved TextVQA scores of only 42.8 for TinyLlama and 35.3 for Phi-2, which are lower than the scores of 45.8 and 51.4 reported in your paper.
What is the version of transformers
you used for training?
I retrained TinyLlama and Phi-2 using the official LLaVA-1.5 pre-training and fine-tuning scripts, following the conversation mode you provided. However, I achieved TextVQA scores of only 42.8 for TinyLlama and 35.3 for Phi-2, which are lower than the scores of 45.8 and 51.4 reported in your paper.
That is very odd.
We trained our model using transformers 4.37.2
. Make sure you are using our codebase for better reproducibility.
Thank you for your response and the excellent open-source work. Could you please send me the pre-training and fine-tuning scripts for LLaVA-Phi under the base training recipe? You can reach me at xsl.cmd@gmail.com. Thanks.
I've emailed you the raw training scripts. Please let me know how it goes.
I obtained the TextVQA result of 51.4 as reported in your paper using the provided script. Thank you a lot. There are some different settings during pre-train and finetune resulting in a large performance drop (36.1 vs 51.4).
I don't understand why these three differences have such a big impact on the results. Have you ever encountered the same thing?
After experimental verification, using fp16 instead of bf16 has a significant impact on phi-2.
We observed simliar results using bf16. I think it's because phi-2's original weights are released in fp16, and converting them to bf16 will cause some unexpected behaviors under the vlm framework. For example, during pretraining, we observed that it was very hard for phi-2 to converge when using bf16.
@xushilin1 @baichuanzhou could you share the scripts? lucamedeiros@outlook.com
Hi, I observed that the results in Fig. 7(C) were obtained from training with the LLaVA dataset using the base recipe. However, these results are notably higher than those reported in this paper (https://arxiv.org/pdf/2401.02330.pdf), and I have been unable to replicate your findings.
I am curious if there are any differences between our approaches.