torch.OutOfMemoryError: CUDA out of memory

I try to fine-tune Qwen2.5-coder but get the torch.OutOfMemoryError: CUDA out of memory error, then i try the trainning script of qwen2 and it work. Is there something wrong? Command and the outputs as follows

Command

python -m torch.distributed.run \
--nproc_per_node=${} \
--master_addr=${MASTER_ADDR} \
--master_port=$MASTER_PORT \
--nnodes=$WORLD_SIZE \
--node_rank=$RANK \
train.py \
--model_name_or_path /Qwen2.5-Coder-32B-Instruct \
--data_path XX \
--bf16 True \
--output_dir XX \
--num_train_epochs 4 \
--model_max_length 10000 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 0.25 \
--save_total_limit 3 \
--learning_rate 5e-6 \
--warmup_steps 10 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--report_to "none" \
--gradient_checkpointing True \
--lazy_preprocess True \
--deepspeed ./ds.json

outputs

2024-11-20 14:37:16 [rank3]:   File "/Qwen2.5-Coder/finetuning/sft/train.py", line 190, in <module>
2024-11-20 14:37:16 [rank3]:     train()
2024-11-20 14:37:16 [rank3]:   File "/Qwen2.5-Coder/finetuning/sft/train.py", line 185, in train
2024-11-20 14:37:16 [rank3]:     trainer.train()
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2123, in train
2024-11-20 14:37:16 [rank3]:     return inner_training_loop(
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2481, in _inner_training_loop
2024-11-20 14:37:16 [rank3]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3579, in training_step
2024-11-20 14:37:16 [rank3]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3633, in compute_loss
2024-11-20 14:37:16 [rank3]:     outputs = model(**inputs)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
2024-11-20 14:37:16 [rank3]:     return self._call_impl(*args, **kwargs)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
2024-11-20 14:37:16 [rank3]:     return forward_call(*args, **kwargs)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
2024-11-20 14:37:16 [rank3]:     ret_val = func(*args, **kwargs)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1846, in forward
2024-11-20 14:37:16 [rank3]:     loss = self.module(*inputs, **kwargs)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
2024-11-20 14:37:16 [rank3]:     return self._call_impl(*args, **kwargs)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1603, in _call_impl
2024-11-20 14:37:16 [rank3]:     result = forward_call(*args, **kwargs)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 1164, in forward
2024-11-20 14:37:16 [rank3]:     outputs = self.model(
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
2024-11-20 14:37:16 [rank3]:     return self._call_impl(*args, **kwargs)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1603, in _call_impl
2024-11-20 14:37:16 [rank3]:     result = forward_call(*args, **kwargs)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 895, in forward
2024-11-20 14:37:16 [rank3]:     layer_outputs = decoder_layer(
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
2024-11-20 14:37:16 [rank3]:     return self._call_impl(*args, **kwargs)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1603, in _call_impl
2024-11-20 14:37:16 [rank3]:     result = forward_call(*args, **kwargs)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 637, in forward
2024-11-20 14:37:16 [rank3]:     hidden_states = self.post_attention_layernorm(hidden_states)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
2024-11-20 14:37:16 [rank3]:     return self._call_impl(*args, **kwargs)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1603, in _call_impl
2024-11-20 14:37:16 [rank3]:     result = forward_call(*args, **kwargs)
2024-11-20 14:37:16 [rank3]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 78, in forward
2024-11-20 14:37:16 [rank3]:     hidden_states = hidden_states.to(torch.float32)
2024-11-20 14:37:16 [rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 132.00 MiB. GPU 3 has a total capacity of 79.10 GiB of which 98.00 MiB is free. Process 387746 has 78.89 GiB memory in use. Of the allocated memory 70.12 GiB is allocated by PyTorch, and 765.81 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2024-11-20 14:37:16 DEBUG:filelock:Attempting to acquire lock 140341412642880 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-11-20 14:37:16 DEBUG:filelock:Lock 140341412642880 acquired on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-11-20 14:37:16 DEBUG:filelock:Attempting to release lock 140341412642880 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-11-20 14:37:16 DEBUG:filelock:Lock 140341412642880 released on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-11-20 14:37:16 DEBUG:filelock:Attempting to acquire lock 140341412642880 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-11-20 14:37:16 DEBUG:filelock:Lock 140341412642880 acquired on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-11-20 14:37:16 DEBUG:filelock:Attempting to release lock 140341412642880 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-11-20 14:37:16 DEBUG:filelock:Lock 140341412642880 released on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-11-20 14:37:28 W1120 14:37:28.532000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 93 closing signal SIGTERM
2024-11-20 14:37:28 W1120 14:37:28.533000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 94 closing signal SIGTERM
2024-11-20 14:37:28 W1120 14:37:28.533000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 95 closing signal SIGTERM
2024-11-20 14:37:28 W1120 14:37:28.533000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 97 closing signal SIGTERM
2024-11-20 14:37:28 W1120 14:37:28.534000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 98 closing signal SIGTERM
2024-11-20 14:37:28 W1120 14:37:28.534000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 99 closing signal SIGTERM
2024-11-20 14:37:28 W1120 14:37:28.535000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 100 closing signal SIGTERM
2024-11-20 14:37:58 W1120 14:37:58.535000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 93 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
2024-11-20 14:38:09 W1120 14:38:09.109000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 94 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
2024-11-20 14:38:09 W1120 14:38:09.138000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 97 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
2024-11-20 14:38:09 W1120 14:38:09.165000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 99 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
2024-11-20 14:38:09 W1120 14:38:09.183000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 100 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
2024-11-20 14:38:16 E1120 14:38:16.510000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 3 (pid: 96) of binary: /usr/bin/python
2024-11-20 14:38:16 Traceback (most recent call last):
2024-11-20 14:38:16   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-11-20 14:38:16     return _run_code(code, main_globals, None,
2024-11-20 14:38:16   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2024-11-20 14:38:16     exec(code, run_globals)
2024-11-20 14:38:16   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 905, in <module>
2024-11-20 14:38:16     main()
2024-11-20 14:38:16   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
2024-11-20 14:38:16     return f(*args, **kwargs)
2024-11-20 14:38:16   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
2024-11-20 14:38:16     run(args)
2024-11-20 14:38:16   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
2024-11-20 14:38:16     elastic_launch(
2024-11-20 14:38:16   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
2024-11-20 14:38:16     return launch_agent(self._config, self._entrypoint, list(args))
2024-11-20 14:38:16   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
2024-11-20 14:38:16     raise ChildFailedError(
2024-11-20 14:38:16 torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
2024-11-20 14:38:16 ============================================================
2024-11-20 14:38:16 /Qwen2.5-Coder/finetuning/sft/train.py FAILED
2024-11-20 14:38:16 ------------------------------------------------------------
2024-11-20 14:38:16 Failures:
2024-11-20 14:38:16   <NO_OTHER_FAILURES>
2024-11-20 14:38:16 ------------------------------------------------------------
2024-11-20 14:38:16 Root Cause (first observed failure):
2024-11-20 14:38:16 [0]:
2024-11-20 14:38:16   time      : 2024-11-20_14:37:28
2024-11-20 14:38:16   host      : 584-5971-20241120143511-master-0
2024-11-20 14:38:16   rank      : 3 (local_rank: 3)
2024-11-20 14:38:16   exitcode  : 1 (pid: 96)
2024-11-20 14:38:16   error_file: <N/A>
2024-11-20 14:38:16   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

QwenLM / Qwen2.5-Coder

torch.OutOfMemoryError: CUDA out of memory #174