Closed JosephPai closed 9 months ago
I actually had this problem, but when I increased the number of GPUs (batch size per GPU remained the same), the problem was solved.
Maybe you can try zero3_offload.json
.
I think it should have something to do with deepspeed.
How many GPUs are you using? I'm using 4 GPUs (40G) under your default setting. I would really appreciate that if you could help on this issue. The scripts is here
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed llava/train/train_mem.py \ --deepspeed ./scripts/zero2.json \ --model_name_or_path lmsys/vicuna-7b-v1.5 \ --version v1 \ --data_path VideoLLAVA/train_json/videochatgpt_conv_tune.json \ --video_folder VideoChatGPT/Activity_Videos \ --image_folder None \ --X "Video" \ --video_tower LanguageBind/LanguageBind_Video_merge \ --pretrain_mm_mlp_adapter checkpoints/Video-LLaVA-Pretrain-7B/mm_projector.bin \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_x_start_end False \ --mm_use_x_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir ./checkpoints/Video-LLaVA-7B \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 32 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --lazy_preprocess True \ --report_to tensorboard \ --cache_dir "/home/ubuntu/.cache/huggingface/hub" \
@LinB203
Yes, default setting is running on 4 gpus.
I'm also using 4 GPUs. Are using 80G A100? Do you have any idea about the OOM error given the script above?
Yes, we use 80G. You can try to reduce the batch size while increasing the accumulated gradient step. By the way, the current architecture is inputting 8 full frames, meaning 256 tokens per frame. 2048 tokens in total. This is very costly for LLM. We are working on improving this. In the next version we will compress tokens, expand the data size, and create a new and more efficient encoder architecture so that community researchers can train on 40G A100 or even V100.
I actually had this problem, but when I increased the number of GPUs (batch size per GPU remained the same), the problem was solved. Maybe you can try
zero3_offload.json
. I think it should have something to do with deepspeed.
In zero3, the model is splited equally into all GPUs. See ZeRO paper.
Finally, I decreased the batch size into 1 and run it on 8 40G GPUs, it works.
So that the num_gpus * per_device_train_batch_size * gradient_accumulation_steps
would be the same as the original script.
I uploaded zero2_offload.json
, you can try --deepspeed . /scripts/zero2_offload.json
, feel free to let me know of any updates.
I uploaded
zero2_offload.json
, you can try--deepspeed . /scripts/zero2_offload.json
, feel free to let me know of any updates.
In the paper you mention using batch size 128 during finetuning. With batch size per GPU of 16 (as in the finetune.sh script), do you then use 8x A100 (80G) GPUs?
Dear author,
Thanks for releasing the amazing code. I'm trying to train the model using A100 (40G).
I loaded the pre-trained
mm_projector.bin
and run thefinetune.sh
script with video data. However, even I decreased theper_device_train_batch_size
into 1, I still got CUDA Out of memory. I noticed that the default setting is 16. So I wonder is there something wrong?Looking forward to hearing back from you.
Thanks!