Fine-tuning the code of llava, the loss decreases abnormally

OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

https://internvl.readthedocs.io/en/latest/

MIT License

5.4k stars 421 forks source link

Fine-tuning the code of llava, the loss decreases abnormally #20

Closed binwang777 closed 4 months ago

binwang777 commented 8 months ago

I use the code you provided to train llava based on Intervit6B. According to the script you provided, the first stage of pretrain is running normally. But when using the fine-tuning script for training, I found strange loss transformations.

As shown in the picture: I have not modified any llava code

I debugged EVA_clip_vit and found the problem. It happened in zero2. When replacing it with zero3, the loss was normal. However, Intervit6B has the following problems when using zero3:

czczup commented 8 months ago

Hello, I'm sorry. I might have introduced some new bugs while organizing the code, and I am currently checking; I will fix it as soon as possible.

binwang777 commented 8 months ago

Hi, regarding the problem of zero2 training loss being 0.0, I have fixed it here. After replacing the deepspeed version with 0.9.5, the loss converges normally. However, there are still problems with zero3 support, and the reason seems to be here:

czczup commented 8 months ago

Hi, regarding the problem of zero2 training loss being 0.0, I have fixed it here. After replacing the deepspeed version with 0.9.5, the loss converges normally. However, there are still problems with zero3 support, and the reason seems to be here:

OK, I get it. Thanks for your feedback.

czczup commented 8 months ago

In the case of deepspeed zero3, it seems that resizing position encodings is not feasible. I am looking for a solution; I used to pre-resize and save them in advance, but I'm not too fond of doing it that way.

czczup commented 8 months ago

This bug should have been fixed. I have attempted both pre-training and fine-tuning, and the loss curves are normal. Additionally, the results from testing the newly trained models on the benchmarks are also normal.

binwang777 commented 8 months ago

Hi, I have trained based on vicuna7b using the updated code. This time I only changed the dataset loading and weight loading, But my results are quite different from what you posted, is there still something wrong within the code. I suspect that flash_attn is not version 0.2.8 causing the problem, so I tested the weights you provided and the results are as follows: The results are slightly fluctuating, but still reasonable.