DAMO-NLP-SG / Video-LLaMA

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
BSD 3-Clause "New" or "Revised" License
2.77k stars 255 forks source link

visionbranch stage2 finetune with LLaMA-2 13B, A100 80GB is out of GPU memory. #90

Closed joeking11829 closed 1 year ago

joeking11829 commented 1 year ago

Hi guys,

Thanks for your great works.

When using the 'visionbranch_stage2_finetune.yaml' configuration to fine-tune 'VL_LLaMA_2_13B_Pretrained.pth' on an A100 80GB.

  max_epoch: 3
  iters_per_epoch: 1000
  batch_size_train: 4
  batch_size_eval: 4
  num_workers: 4

I found that the training program is out of GPU memory.

I only succeeded in starting the training program when I set the batch size to 3.

Do you have any suggestions ? Thanks !!

joeking11829 commented 1 year ago

By the way, the "iters_per_epoch" is set to 1000, so during training, it only uses batch_size number of GPUs iters_per_epoch of the total training set. Each epoch, model will watch 32000 (4 8 1000) random samples. Is that correct ?

  max_epoch: 3
  iters_per_epoch: 1000
  batch_size_train: 4
  batch_size_eval: 4
  num_workers: 4

Thanks !!

hangzhang-nlp commented 1 year ago

Q1: To reduce GPU memory consumption, you can try the following:

Reduce the batch size. Enable BF16 (bfloat16) for mixed-precision training. Shorten the maximum input sequence length for the language model.

Q2: No, the total number of samples you will see during training is calculated as 4 (batch_size) 8 1000 (iters_per_epoch) * 3 (max_epoch) = 96000 samples. The iters_per_epoch parameter is mainly used to determine how often a checkpoint is saved.