Unable to Reproduce Training Process

Toneyaya commented 4 months ago

Hello, thank you for your outstanding open-source work! I encountered a problem during the second stage of training when attempting to reproduce the training process. The loss becomes zero across all iterations. This happens regardless of whether I use my trained mm_projector.bin or the weights you released. The loss always drops to zero within the first few iterations. I have followed your instructions precisely to reproduce the training process.

(If anyone has successfully reproduced the training process, let's discuss this issue together.)

jpthu17 commented 4 months ago

loss is 0, which may be caused by numerical overflow. What kind of GPUs do you use? You can confirm whether the GPU supports bf16. torch.cuda.is_bf16_supported() If it does not support bf16, the value overflow may be caused.

annopackage commented 1 month ago

discuss

same problem here. have you figured it out? solved by: https://github.com/haotian-liu/LLaVA/issues/1231 however, zero3 with deepspeed would hang.

Toneyaya commented 1 month ago

Yes. Thanks for your help!

PKU-YuanGroup / Chat-UniVi

Unable to Reproduce Training Process #46