Open CBZhao2021 opened 1 year ago
I think you miss the
{"image": ["image1.jpg","image2.jpg"], "text": "The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nHuman: <image>\nHuman: <image>\nHuman: Can you compare the different between these two images?\nAI: xxxx", "task_type": "xxx"}
Thank you for your answer!
I'm sorry for my type error, I have the input token in the custom data, otherwise the program would report error. But I think this is not the reason which caused the NaN, is there any possibility that data of specific task caused the nonconvergence? The custom dataset is about the urban landscape estimation and comparison.
I first tried to convert the case data in OwlEval to the train and val data as:
{"image": ["./OwlEval/cases/5.jpg"], "text": "The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nHuman:
Then I tried my dataset as:
{"image": ["image1.jpg","image2.jpg"], "text": "The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nHuman:
To figure out the problem, I mixed the case data and the custom data as: case data for 82 lines + custom data for 10 lines, it converged. Then we change the setting to: case data for 82 lines + custom data for 40 lines, it was NaN.
I trained our data just on the second stage, with A100 40G *4 GPUs, batchsize 4x4. So I want to ask if there is any additional requirement or restriction for the dataset, or is there any training tricks?
I have met the similar case. I think there are some overflow during the training. I recommend you to have a look on the validation which is more reliable.
请问解决了吗,我也遇到了loss=nan这个问题
我和他遇到了一样的问题,在你们的数据集上,loss不为nan,在自己的数据集上,loss为nan
我和他遇到了一样的问题,在你们的数据集上,loss不为nan,在自己的数据集上,loss为nan
There is a high possibility that the issue is caused by the prompt being too long and the part complement being cut off during preprocessing. As a result, the label_mask
is all -100 and CrossEntropy
will return a loss of NaN. There are three potential solutions to this problem:
reduction='none'
in the CrossEntropyLoss
of Llama and reduce the loss with an epsilon value likeoutputs.loss = (outputs.loss * loss_mask.view(-1)).sum()/(loss_mask.sum()+1e-5)
.output.loss=output.loss*0
if you get a loss of NaN.
Thank you for your contribution!
I tried to make a custom image pair dataset as:
{"image": ["image1.jpg","image2.jpg"], "text": "The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nHuman:\nHuman: \nHuman: Can you compare the different between these two images?\nAI: xxxx", "task_type": "xxx"}
However the training loss is always NaN. How can I train a custom image pair dataset, or how did you train your video data?
Thank you so much!