是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

Hi I am trying to finetune openbmb/MiniCPM-Llama3-V-2_5 model using g5.12xlarge Instance

Instance details : Total Number of GPUS : 4 Total GB available per GPU : 24GB Precision type supported : bf16, fp16

I am using Full Parameter Finetuning with my dataset.

Even after using g5.12xlarge, finetune_ds.sh script is throwing CUDA Out of memory issue, Below are the details of the issue...

[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.74 GiB. GPU 0 has a total capacity of 21.99 GiB of which 943.06 MiB is free. Including non-PyTorch memory, this process has 21.05 GiB memory in use. Of the allocated memory 17.98 GiB is allocated by PyTorch, and 2.59 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 0%| | 0/1000 [00:17<?, ?it/s] W0918 10:19:30.430000 137310713283648 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 12239 closing signal SIGTERM W0918 10:19:30.430000 137310713283648 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 12240 closing signal SIGTERM W0918 10:19:30.430000 137310713283648 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 12241 closing signal SIGTERM E0918 10:19:30.795000 137310713283648 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 12238) of binary:

The same happened with other three GPU's too.

It is mentioned 15.8GB/GPU Is required for finetuning a model which is not happening as 22.5GB/GPU is not sufficient in my case.

Please suggest me a server and required server instance and requirements very clearly.

Details of finetune_ds.sh :

!/bin/bash

GPUS_PER_NODE=4 NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost MASTER_PORT=6001

MODEL="openbmb/MiniCPM-Llama3-V-2_5" DATA= EVAL_DATA= LLM_TYPE="llama3" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm, if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE="llama3" MODEL_MAX_Length=2048 # if conduct multi-images sft, please set MODEL_MAX_Length=4096

DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT " torchrun $DISTRIBUTED_ARGS finetune.py \ --model_name_or_path $MODEL \ --llm_type $LLM_TYPE \ --data_path $DATA \ --eval_data_path $EVAL_DATA \ --remove_unused_columns false \ --label_names "labels" \ --prediction_loss_only false \ --bf16 true \ --bf16_full_eval true \ --fp16 false \ --fp16_full_eval false \ --do_train \ --do_eval \ --tune_vision true \ --tune_llm true \ --model_max_length $MODEL_MAX_Length \ --max_slice_nums 4 \ --max_steps 10000 \ --eval_steps 1000 \ --output_dir output/output_minicpmv_llama3_finetuned_model \ --logging_dir output/output_minicpmv_llama3_finetuned_model \ --logging_strategy "steps" \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "steps" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 10 \ --learning_rate 1e-6 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --gradient_checkpointing true \ --deepspeed ds_config_zero3.json \ --report_to "tensorboard"

I am using ds_config_zero3.json. and BF16 type....

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS: Ubuntu 22.04
- Python: 3.11.4
- Transformers:4.40.0
- PyTorch: 2.1.2+cu121
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 12.1

备注 | Anything else?

Complete logs :

bash finetune_ds.sh W0918 10:18:21.930000 137310713283648 torch/distributed/run.py:779] W0918 10:18:21.930000 137310713283648 torch/distributed/run.py:779] W0918 10:18:21.930000 137310713283648 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0918 10:18:21.930000 137310713283648 torch/distributed/run.py:779] [2024-09-18 10:18:24,769] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-18 10:18:24,871] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-18 10:18:24,915] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-18 10:18:24,932] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-18 10:18:25,704] [INFO] [comm.py:652:init_distributed] cdb=None [2024-09-18 10:18:25,704] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl /home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( [2024-09-18 10:18:25,799] [INFO] [comm.py:652:init_distributed] cdb=None [2024-09-18 10:18:25,856] [INFO] [comm.py:652:init_distributed] cdb=None [2024-09-18 10:18:25,894] [INFO] [comm.py:652:init_distributed] cdb=None /home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( /home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( /home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( [2024-09-18 10:18:26,437] [INFO] [config.py:733:init] Config mesh_device None world_size = 4 [2024-09-18 10:18:26,930] [INFO] [config.py:733:init] Config mesh_device None world_size = 4 [2024-09-18 10:18:27,341] [INFO] [config.py:733:init] Config mesh_device None world_size = 4 [2024-09-18 10:18:27,728] [INFO] [config.py:733:init] Config mesh_device None world_size = 4 [2024-09-18 10:18:34,027] [INFO] [partition_parameters.py:348:exit] finished initializing model - num_params = 743, num_elems = 8.55B Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:25<00:00, 3.71s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:25<00:00, 3.71s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:25<00:00, 3.71s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:26<00:00, 3.73s/it] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. max_steps is given, it will override any value given in num_train_epochs Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. {'Total': 8537092336, 'Trainable': 8537092336} llm_type=llama3 Loading data... max_steps is given, it will override any value given in num_train_epochs max_steps is given, it will override any value given in num_train_epochs max_steps is given, it will override any value given in num_train_epochs Using /home/ubuntu/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja... /home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) Using /home/ubuntu/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.08963775634765625 seconds Using /home/ubuntu/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja... /home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) Using /home/ubuntu/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.09027910232543945 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10106420516967773 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.20101070404052734 seconds Parameter Offload: Total persistent parameters: 706800 in 346 params 0%| | 0/1000 [00:00<?, ?it/s]/home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead. with device_autocast_ctx, torch.cpu.amp.autocast(cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] /home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead. with device_autocast_ctx, torch.cpu.amp.autocast(cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] /home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead. with device_autocast_ctx, torch.cpu.amp.autocast(cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] /home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead. with device_autocast_ctx, torch.cpu.amp.autocast(cpu_autocast_kwargs), recompute_context: # type: ignoreattr-defined: Traceback (most recent call last): rank3: File "/home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/finetune/finetune.py", line 299, in

rank3: File "/home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/finetune/finetune.py", line 289, in train

rank3: File "/home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/transformers/trainer.py", line 1859, in train rank3: return inner_training_loop(

rank3: File "/home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop rank3: tr_loss_step = self.training_step(model, inputs)

rank3: File "/home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/finetune/trainer.py", line 211, in training_step

rank3: File "/home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/accelerate/accelerator.py", line 2117, in backward rank3: self.deepspeed_engine_wrapped.backward(loss, **kwargs) rank3: File "/home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward

rank3: File "/home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2213, in step

rank3: File "/home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2119, in _take_model_step

rank3: File "/home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn rank3: ret_val = func(*args, **kwargs)

rank3: File "/home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2077, in step rank3: self._prepare_sub_group(sub_group_id, timer_names) rank3: File "/home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn rank3: ret_val = func(*args, **kwargs)

rank3: File "/home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1907, in _prepare_sub_group

rank3: File "/home/ubuntu/jayadeep/finetuning/work_17_9_2024/MiniCPM-V/venv_finetuning/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1883, in _prepare_fp32_grad_for_sub_group rank3: single_grad_partition = self.flatten(self.averaged_gradients[sub_group_id]).to(

rank3: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.74 GiB. GPU 3 has a total capacity of 21.99 GiB of which 943.06 MiB is free. Including non-PyTorch memory, this process has 21.05 GiB memory in use. Of the allocated memory 17.98 GiB is allocated by PyTorch, and 2.59 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

OpenBMB / MiniCPM-V

[BUG] <title>what are the system requirements for finetuning "openbmb/MiniCPM-Llama3-V-2_5" #588