PKU-YuanGroup / Video-LLaVA

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
2.84k stars 204 forks source link

Instruction tuning on A100 (40G)? #32

Closed JosephPai closed 9 months ago

JosephPai commented 9 months ago

Dear author,

Thanks for releasing the amazing code. I'm trying to train the model using A100 (40G).

I loaded the pre-trained mm_projector.bin and run the finetune.sh script with video data. However, even I decreased the per_device_train_batch_size into 1, I still got CUDA Out of memory. I noticed that the default setting is 16. So I wonder is there something wrong?

Looking forward to hearing back from you.

Thanks!

LinB203 commented 9 months ago

I actually had this problem, but when I increased the number of GPUs (batch size per GPU remained the same), the problem was solved. Maybe you can try zero3_offload.json. I think it should have something to do with deepspeed.

JosephPai commented 9 months ago

How many GPUs are you using? I'm using 4 GPUs (40G) under your default setting. I would really appreciate that if you could help on this issue. The scripts is here

HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed llava/train/train_mem.py \ --deepspeed ./scripts/zero2.json \ --model_name_or_path lmsys/vicuna-7b-v1.5 \ --version v1 \ --data_path VideoLLAVA/train_json/videochatgpt_conv_tune.json \ --video_folder VideoChatGPT/Activity_Videos \ --image_folder None \ --X "Video" \ --video_tower LanguageBind/LanguageBind_Video_merge \ --pretrain_mm_mlp_adapter checkpoints/Video-LLaVA-Pretrain-7B/mm_projector.bin \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_x_start_end False \ --mm_use_x_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir ./checkpoints/Video-LLaVA-7B \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 32 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --lazy_preprocess True \ --report_to tensorboard \ --cache_dir "/home/ubuntu/.cache/huggingface/hub" \

JosephPai commented 9 months ago

@LinB203

LinB203 commented 9 months ago

Yes, default setting is running on 4 gpus.

JosephPai commented 9 months ago

I'm also using 4 GPUs. Are using 80G A100? Do you have any idea about the OOM error given the script above?

LinB203 commented 9 months ago

Yes, we use 80G. You can try to reduce the batch size while increasing the accumulated gradient step. By the way, the current architecture is inputting 8 full frames, meaning 256 tokens per frame. 2048 tokens in total. This is very costly for LLM. We are working on improving this. In the next version we will compress tokens, expand the data size, and create a new and more efficient encoder architecture so that community researchers can train on 40G A100 or even V100.

SCZwangxiao commented 9 months ago

I actually had this problem, but when I increased the number of GPUs (batch size per GPU remained the same), the problem was solved. Maybe you can try zero3_offload.json. I think it should have something to do with deepspeed.

In zero3, the model is splited equally into all GPUs. See ZeRO paper.

JosephPai commented 9 months ago

Finally, I decreased the batch size into 1 and run it on 8 40G GPUs, it works. So that the num_gpus * per_device_train_batch_size * gradient_accumulation_steps would be the same as the original script.

LinB203 commented 8 months ago

I uploaded zero2_offload.json, you can try --deepspeed . /scripts/zero2_offload.json, feel free to let me know of any updates.

Ali2500 commented 3 months ago

I uploaded zero2_offload.json, you can try --deepspeed . /scripts/zero2_offload.json, feel free to let me know of any updates.

In the paper you mention using batch size 128 during finetuning. With batch size per GPU of 16 (as in the finetune.sh script), do you then use 8x A100 (80G) GPUs?