PKU-YuanGroup / MoE-LLaVA

Mixture-of-Experts for Large Vision-Language Models
https://arxiv.org/abs/2401.15947
Apache License 2.0
1.9k stars 121 forks source link

[Question] CUDA OOM when finetune phi2-clipL336 at stage 2 with 8-A100-40G #72

Closed terry-for-github closed 4 months ago

terry-for-github commented 4 months ago

Question

I tried to train the moe-llava(phi2 clip-vit-L-336) follow the official tutorial. I've finished the first stage pretraining successfully. (scripts/v1/phi2/pretrain.sh) But raise CUDA OOM error at the second stage. (scripts/v1/phi2/finetune.sh) Is this normal? Is it because the 40G VRAM is not enough or is something not set properly? I'd really appreciate it if someone could share some own training experience or give some advice.

My 1st stage run command: CODE_FOLDER="${HOME_FOLDER}/code" DATA_FOLDER="${HOME_FOLDER}/data" MODEL_FOLDER="${HOME_FOLDER}/models" JSON_FOLDER="${DATA_FOLDER}/MoE-LLaVA-Json" IMAGE_FOLDER="${DATA_FOLDER}/MoE-LLaVA-Image" cd ${CODE_FOLDER}/MoE-LLaVA HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py \ --deepspeed ./scripts/zero2.json \ --model_name_or_path ${MODEL_FOLDER}/microsoft_phi-2 \ --version plain \ --data_path ${JSON_FOLDER}/llavaimage.json \ --image_folder ${IMAGE_FOLDER} \ --image_tower ${MODEL_FOLDER}/openai_clip-vit-large-patch14-336 \ --image_projector_type mlp2x_gelu \ --tune_mm_mlp_adapter True \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --bf16 True \ --output_dir ./checkpoints/llavaphi-2.7b-pretrain \ --num_train_epochs 1 \ --per_device_train_batch_size 32 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 24000 \ --save_total_limit 1 \ --learning_rate 1e-3 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --lazy_preprocess True \ --report_to tensorboard \ --cache_dir "./cache_dir"

My 2nd stage run command: CODE_FOLDER="${HOME_FOLDER}/code" DATA_FOLDER="${HOME_FOLDER}/data" MODEL_FOLDER="${HOME_FOLDER}/models" JSON_FOLDER="${DATA_FOLDER}/MoE-LLaVA-Json" IMAGE_FOLDER="${DATA_FOLDER}/MoE-LLaVA-Image" cd ${CODE_FOLDER}/MoE-LLaVA HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py \ --deepspeed ./scripts/zero2.json \ --model_name_or_path ${MODEL_FOLDER}/microsoft_phi-2 \ --version phi \ --data_path ${JSON_FOLDER}/la_tune_256k.json \ ${JSON_FOLDER}/lrv_tune_331k.json ${JSON_FOLDER}/lvis_tune220k.json \ ${JSON_FOLDER}/svit_tune_157k.json ${JSON_FOLDER}/nlp_tune.json \ --image_folder ${IMAGE_FOLDER} \ --image_tower ${MODEL_FOLDER}/openai_clip-vit-large-patch14-336 \ --image_projector_type mlp2x_gelu \ --pretrain_mm_mlp_adapter ./checkpoints/llavaphi-2.7b-pretrain/mm_projector.bin \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir ./checkpoints/llavaphi-2.7b-finetune \ --num_train_epochs 1 \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 2 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to tensorboard \ --cache_dir "./cache_dir"

CUDA OOM Error: File "/root/miniconda3/envs/moellava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/root/miniconda3/envs/moellava/lib/python3.10/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/root/miniconda3/envs/moellava/lib/python3.10/site-packages/transformers/trainer.py", line 2777, in training_step self.accelerator.backward(loss) File "/root/miniconda3/envs/moellava/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/root/miniconda3/envs/moellava/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, kwargs) File "/root/miniconda3/envs/moellava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) File "/root/miniconda3/envs/moellava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1861, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/root/miniconda3/envs/moellava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1900, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/root/miniconda3/envs/moellava/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/root/miniconda3/envs/moellava/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/root/miniconda3/envs/moellava/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/miniconda3/envs/moellava/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, args) File "/root/miniconda3/envs/moellava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/root/miniconda3/envs/moellava/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.56 GiB (GPU 0; 39.43 GiB total capacity; 33.67 GiB already allocated; 3.18 GiB free; 34.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 1%|█▌ | 83/7884 [09:01<14:07:57, 6.52s/it]

Thank you!

terry-for-github commented 4 months ago

I solved the problem by reducing the batch_size.

Just change the argument from --per_device_train_batch_size 8 --per_device_eval_batch_size 4 --gradient_accumulation_steps 2 to --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 4