Sally-SH / VSP-LLM

Other
291 stars 24 forks source link

The loss becomes to nan when batch size > 1 #5

Open ReflectionL opened 1 month ago

ReflectionL commented 1 month ago

in the training process, I found that if I set batch size > 1, the loss sometimes will be nan

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)

and some logits also nan

(Pdb) p llm_out.logits
tensor([[[ -1.8848,   2.0195,   1.3193,  ...,   0.4946,  -0.0728,  -0.1754],
         [ -7.1992,  -6.4062,   1.8340,  ...,  -0.9331,  -3.5195,  -2.0039],
         [  5.8125,   8.0703,  -0.6016,  ...,   9.7422,   8.6641,   8.7109],
         ...,
         [-14.9766, -12.4609,  -0.7085,  ...,  -8.5781,  -9.7344,  -7.9766],
         [-14.3125, -11.6562,   1.1104,  ...,  -8.6172,  -9.7812,  -7.9336],
         [ -3.6602,  20.9219,   6.5430,  ...,  -1.7139,  -2.1250,  -0.8105]],

        [[ -1.8848,   2.0195,   1.3193,  ...,   0.4946,  -0.0728,  -0.1754],
         [ -7.1992,  -6.4062,   1.8340,  ...,  -0.9331,  -3.5195,  -2.0039],
         [  5.8125,   8.0703,  -0.6016,  ...,   9.7422,   8.6641,   8.7109],
         ...,
         [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
         [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
         [     nan,      nan,      nan,  ...,      nan,      nan,      nan]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)

I checked the padding of labels and features, that's all ok. And if I set the batch size to 1, this problem won't happen. So I think this may not an issue caused by quantization. NEED some HELP plz

JeongHun0716 commented 1 month ago

In the VSP-LLM GitHub project, batch training and inference have not yet been implemented. To facilitate the implementation of batch training, we recommend that padding for instructions, visual features, and labels be aligned to the left side of each llm_input. Additionally, an attention mask corresponding to this llm_input should be incorporated into the LLM.

For instance: instruction=[x, x, pad] visual feature=[x, x, pad] labels=[x, x, x, pad, pad] -> llm_input=[pad, pad, pad, pad, x, x, x, x, x, x, x ]

By implementing this process, you can efficiently train the VSP-LLM model using batches.