X-PLUG / mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
https://www.modelscope.cn/studios/damo/mPLUG-Owl
MIT License
2.25k stars 171 forks source link

How to do the training on multiple images or image pair data? #61

Open CBZhao2021 opened 1 year ago

CBZhao2021 commented 1 year ago

Thank you for your contribution!

I tried to make a custom image pair dataset as:

{"image": ["image1.jpg","image2.jpg"], "text": "The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nHuman: \nHuman: \nHuman: Can you compare the different between these two images?\nAI: xxxx", "task_type": "xxx"}

However the training loss is always NaN. How can I train a custom image pair dataset, or how did you train your video data?

Thank you so much!

MAGAer13 commented 1 year ago

I think you miss the token as the placeholder for the image inputs? You may try this:

{"image": ["image1.jpg","image2.jpg"], "text": "The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nHuman: <image>\nHuman: <image>\nHuman: Can you compare the different between these two images?\nAI: xxxx", "task_type": "xxx"}
CBZhao2021 commented 1 year ago

Thank you for your answer!

I'm sorry for my type error, I have the input token in the custom data, otherwise the program would report error. But I think this is not the reason which caused the NaN, is there any possibility that data of specific task caused the nonconvergence? The custom dataset is about the urban landscape estimation and comparison.

I first tried to convert the case data in OwlEval to the train and val data as: {"image": ["./OwlEval/cases/5.jpg"], "text": "The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nHuman: \nHuman: Where is the frisbee in the image? \nAI: The frisbee is in the air, being caught by the man in red.\n", "task_type": "gpt4instruct_sft"} ... totally 82 lines Then tried to train, and the loss converged as: 0.7644 0.7428 ... 0.0154, in about 1000 iters.

Then I tried my dataset as: {"image": ["image1.jpg","image2.jpg"], "text": "The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nHuman: \nHuman: \nHuman: Can you compare the different between these two images?\nAI: xxxx", "task_type": "xxx"} ... totally 4000 lines, the loss is: NaN NaN NaN.

To figure out the problem, I mixed the case data and the custom data as: case data for 82 lines + custom data for 10 lines, it converged. Then we change the setting to: case data for 82 lines + custom data for 40 lines, it was NaN.

I trained our data just on the second stage, with A100 40G *4 GPUs, batchsize 4x4. So I want to ask if there is any additional requirement or restriction for the dataset, or is there any training tricks?

MAGAer13 commented 1 year ago

I have met the similar case. I think there are some overflow during the training. I recommend you to have a look on the validation which is more reliable.

hangzeli08 commented 1 year ago

请问解决了吗,我也遇到了loss=nan这个问题

hangzeli08 commented 1 year ago

我和他遇到了一样的问题,在你们的数据集上,loss不为nan,在自己的数据集上,loss为nan

LukeForeverYoung commented 1 year ago

我和他遇到了一样的问题,在你们的数据集上,loss不为nan,在自己的数据集上,loss为nan

There is a high possibility that the issue is caused by the prompt being too long and the part complement being cut off during preprocessing. As a result, the label_mask is all -100 and CrossEntropy will return a loss of NaN. There are three potential solutions to this problem:

  1. Prevent feeding input_ids that do not include any complement.
  2. Set the reduction='none' in the CrossEntropyLoss of Llama and reduce the loss with an epsilon value likeoutputs.loss = (outputs.loss * loss_mask.view(-1)).sum()/(loss_mask.sum()+1e-5).
  3. Simply set output.loss=output.loss*0 if you get a loss of NaN.