nan loss for stage3 training of videochat2_mistral

OpenGVLab / Ask-Anything

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.

https://vchat.opengvlab.com/

MIT License

2.85k stars 230 forks source link

nan loss for stage3 training of videochat2_mistral #190

Closed LiJiaqi96 closed 3 weeks ago

LiJiaqi96 commented 3 weeks ago

Hi, I tried to run the stage 3 training script of videochat2_mistral, but I got nan loss after the first iteration. I tried to use smaller lr but the loss remains to be nan.

2024-06-05T14:28:35 | utils.basic_utils: Train Epoch: [0]  [     0/153075]  eta: 6 days, 4:16:39  lr: 0.000003  image-loss: No data  video-loss: 3.1493  time: 3.4872  data: 0.0566  max mem: 62365 res mem: 75028
2024-06-05T14:28:51 | utils.basic_utils: Train Epoch: [0]  [    10/153075]  eta: 3 days, 5:05:00  lr: 0.000003  image-loss: nan  video-loss: nan  time: 1.8130  data: 0.0076  max mem: 66172 res mem: 75440
2024-06-05T14:29:08 | utils.basic_utils: Train Epoch: [0]  [    20/153075]  eta: 3 days, 1:27:56  lr: 0.000003  image-loss: nan  video-loss: nan  time: 1.6400  data: 0.0018  max mem: 66173 res mem: 75440

Is there any idea for this issue? Thanks!

Andy1621 commented 3 weeks ago

Interesting. It runs normally for me. Could you try to use bf16?

LiJiaqi96 commented 3 weeks ago

Thanks for your timely reply. How should I changed to bf16? Is it correct to change the following places:

change line 128 and 135 in videochat2_it_mistral.pyto torch_dtype=torch.bfloat16
change line 225 in config_7b_stage3.py to fp16=False

Andy1621 commented 3 weeks ago

Change line 128 and 135 in videochat2_it_mistral.py to torch_dtype=torch.bfloat16.
Set with torch.cuda.amp.autocast(enabled=config.fp16): to with torch.cuda.amp.autocast(enabled=config.fp16, dtype=torch.bfloat16): in train_pt.py or train_it.py.

LiJiaqi96 commented 3 weeks ago

Ok many thanks, let me have a try.
BTW, what is the use of fp16=True in config_7b_stage3.py? In my experiment mentioned above, setting fp16=True will result in an error and it was solved by changing it to fp16=False. And what value should I set in this place?

Andy1621 commented 3 weeks ago

fp16=True will use mis precision in with torch.cuda.amp.autocast(enabled=config.fp16)

LiJiaqi96 commented 3 weeks ago

Using bfloat16 solves my problem, thanks!
Is it correct that I set the fp16=False when using bfloat16?

Andy1621 commented 3 weeks ago

It may not correct to set fp16=False since it will close the mix-precision training and requires more GPU memory.

However, bf16 and fp32 both work for stable training. It's okay if you have enough GPU memory.

LiJiaqi96 commented 3 weeks ago

Thanks for clarifying that for me! The code throws another error if I set fp16=True together with using bfloat16:

RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'

Is it the same case on your side?

Andy1621 commented 3 weeks ago

Can you provide the full log? It runs normally for me to use mix-precision training with bf16.

LiJiaqi96 commented 3 weeks ago

train_log.txt Please refer to the log. It might not be straightforward due to the use of DDP.

Andy1621 commented 3 weeks ago

Can you try to add model.bfloat16() in setup_model() in share_utils.py? Please refer the code here since it works for me.

LiJiaqi96 commented 3 weeks ago

Thanks for your suggestions. I tried to solve this issue but it doesn't work.
Finally I find the version of the peft and transformers packages was updated due to the installation of other packages. Using the correct version of packages solved all the errors mentioned above.
Sorry for the efforts and the time you paid on my issue. It's so nice of you to help me a lot.

BTW, I tested the memory usage with fp16=True or fp16=False, and find the GPU memory is similar. Is it better to keep fp16=True while training?

Andy1621 commented 3 weeks ago

Thanks for your try! I think both is ok, because the model will call with self.maybe_autocast(): in forward function for mix-precision training.

LiJiaqi96 commented 3 weeks ago

Thanks for your suggestions :)