DAMO-NLP-SG / Video-LLaMA

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
BSD 3-Clause "New" or "Revised" License
2.77k stars 255 forks source link

inf value occurs during forwarding process when fine-tuning VL branch with LLAVA-150K+MiniGPT4-3.5K+webvid-instruct #138

Open xuboshen opened 9 months ago

xuboshen commented 9 months ago

Great works! But I've met some problems and hope anyone has some ideas.

When I fine-tune the VL branch only with LLaMA-2 on image/video instruction datas, inf values occurs and the value of torch.max(hidden_states) and torch.min(hidden_states) becomes larger and larger.

Several attempts have been made:

Preparations:

My platform: 8*A6000 48G, the environment is setup exactly following the environment.yml in this repository. The data is prepared following LLaVa (coco), WebVid-10M and MiniGPT-4. 7B LLaMA-2 Pretrained weights are from this repo as well.

The demo correctly runs on remote platform, and training process seems correct. I did not modify any code here.

Problem

I found that some data can occur 'inf' numbers at the last layer of LLaMA-2, where the index of decoder layer number is 31 in the autoregressive loop in LLaMA-2. The error does not occurs immediately, instead, the value of torch.max(hidden_states) and torch.min(hidden_states) becomes larger and larger for positives / smaller and smaller for negatives.

-inf of hidden_states training

Do you or anyone have any ideas on why this problem occurs, and how to solve it? I appreciate anyone's time and help in advance.

xuboshen commented 9 months ago

I actually try to set batchsize=1 and the training proceeds as expected, while batchsize=4 produces inf values and fails training.

Could anyone explain this phenomenon?