Closed Toneyaya closed 3 months ago
loss is 0, which may be caused by numerical overflow. What kind of GPUs do you use? You can confirm whether the GPU supports bf16. torch.cuda.is_bf16_supported()
If it does not support bf16, the value overflow may be caused.
discuss
same problem here. have you figured it out? solved by: https://github.com/haotian-liu/LLaVA/issues/1231 however, zero3 with deepspeed would hang.
Yes. Thanks for your help!
Hello, thank you for your outstanding open-source work! I encountered a problem during the second stage of training when attempting to reproduce the training process. The loss becomes zero across all iterations. This happens regardless of whether I use my trained mm_projector.bin or the weights you released. The loss always drops to zero within the first few iterations. I have followed your instructions precisely to reproduce the training process.
(If anyone has successfully reproduced the training process, let's discuss this issue together.)