I used a large model (> 170B) as my reward model. In the very beginning, loss is normal. But when training one step, the loss becomes NAN. This situation didn't happen when I used a smaller base model (e.g., 30B) to train RM. Do you have any suggestions about this?
I used a large model (> 170B) as my reward model. In the very beginning, loss is normal. But when training one step, the loss becomes NAN. This situation didn't happen when I used a smaller base model (e.g., 30B) to train RM. Do you have any suggestions about this?