The loss becomes to nan when batch size > 1

in the training process, I found that if I set batch size > 1, the loss sometimes will be nan

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)

and some logits also nan

(Pdb) p llm_out.logits
tensor([[[ -1.8848,   2.0195,   1.3193,  ...,   0.4946,  -0.0728,  -0.1754],
         [ -7.1992,  -6.4062,   1.8340,  ...,  -0.9331,  -3.5195,  -2.0039],
         [  5.8125,   8.0703,  -0.6016,  ...,   9.7422,   8.6641,   8.7109],
         ...,
         [-14.9766, -12.4609,  -0.7085,  ...,  -8.5781,  -9.7344,  -7.9766],
         [-14.3125, -11.6562,   1.1104,  ...,  -8.6172,  -9.7812,  -7.9336],
         [ -3.6602,  20.9219,   6.5430,  ...,  -1.7139,  -2.1250,  -0.8105]],

        [[ -1.8848,   2.0195,   1.3193,  ...,   0.4946,  -0.0728,  -0.1754],
         [ -7.1992,  -6.4062,   1.8340,  ...,  -0.9331,  -3.5195,  -2.0039],
         [  5.8125,   8.0703,  -0.6016,  ...,   9.7422,   8.6641,   8.7109],
         ...,
         [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
         [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
         [     nan,      nan,      nan,  ...,      nan,      nan,      nan]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)

I checked the padding of labels and features, that's all ok. And if I set the batch size to 1, this problem won't happen. So I think this may not an issue caused by quantization. NEED some HELP plz

Sally-SH / VSP-LLM

The loss becomes to nan when batch size > 1 #5