Open old-kai opened 5 days ago
I try to fine-tune Qwen2.5-coder but get the torch.OutOfMemoryError: CUDA out of memory error, then i try the trainning script of qwen2 and it work. Is there something wrong? Command and the outputs as follows
torch.OutOfMemoryError: CUDA out of memory
Command
python -m torch.distributed.run \ --nproc_per_node=${} \ --master_addr=${MASTER_ADDR} \ --master_port=$MASTER_PORT \ --nnodes=$WORLD_SIZE \ --node_rank=$RANK \ train.py \ --model_name_or_path /Qwen2.5-Coder-32B-Instruct \ --data_path XX \ --bf16 True \ --output_dir XX \ --num_train_epochs 4 \ --model_max_length 10000 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 0.25 \ --save_total_limit 3 \ --learning_rate 5e-6 \ --warmup_steps 10 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --report_to "none" \ --gradient_checkpointing True \ --lazy_preprocess True \ --deepspeed ./ds.json
outputs
2024-11-20 14:37:16 [rank3]: File "/Qwen2.5-Coder/finetuning/sft/train.py", line 190, in <module> 2024-11-20 14:37:16 [rank3]: train() 2024-11-20 14:37:16 [rank3]: File "/Qwen2.5-Coder/finetuning/sft/train.py", line 185, in train 2024-11-20 14:37:16 [rank3]: trainer.train() 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2123, in train 2024-11-20 14:37:16 [rank3]: return inner_training_loop( 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2481, in _inner_training_loop 2024-11-20 14:37:16 [rank3]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3579, in training_step 2024-11-20 14:37:16 [rank3]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3633, in compute_loss 2024-11-20 14:37:16 [rank3]: outputs = model(**inputs) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl 2024-11-20 14:37:16 [rank3]: return self._call_impl(*args, **kwargs) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl 2024-11-20 14:37:16 [rank3]: return forward_call(*args, **kwargs) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn 2024-11-20 14:37:16 [rank3]: ret_val = func(*args, **kwargs) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1846, in forward 2024-11-20 14:37:16 [rank3]: loss = self.module(*inputs, **kwargs) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl 2024-11-20 14:37:16 [rank3]: return self._call_impl(*args, **kwargs) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1603, in _call_impl 2024-11-20 14:37:16 [rank3]: result = forward_call(*args, **kwargs) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 1164, in forward 2024-11-20 14:37:16 [rank3]: outputs = self.model( 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl 2024-11-20 14:37:16 [rank3]: return self._call_impl(*args, **kwargs) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1603, in _call_impl 2024-11-20 14:37:16 [rank3]: result = forward_call(*args, **kwargs) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 895, in forward 2024-11-20 14:37:16 [rank3]: layer_outputs = decoder_layer( 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl 2024-11-20 14:37:16 [rank3]: return self._call_impl(*args, **kwargs) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1603, in _call_impl 2024-11-20 14:37:16 [rank3]: result = forward_call(*args, **kwargs) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 637, in forward 2024-11-20 14:37:16 [rank3]: hidden_states = self.post_attention_layernorm(hidden_states) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl 2024-11-20 14:37:16 [rank3]: return self._call_impl(*args, **kwargs) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1603, in _call_impl 2024-11-20 14:37:16 [rank3]: result = forward_call(*args, **kwargs) 2024-11-20 14:37:16 [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 78, in forward 2024-11-20 14:37:16 [rank3]: hidden_states = hidden_states.to(torch.float32) 2024-11-20 14:37:16 [rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 132.00 MiB. GPU 3 has a total capacity of 79.10 GiB of which 98.00 MiB is free. Process 387746 has 78.89 GiB memory in use. Of the allocated memory 70.12 GiB is allocated by PyTorch, and 765.81 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2024-11-20 14:37:16 DEBUG:filelock:Attempting to acquire lock 140341412642880 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock 2024-11-20 14:37:16 DEBUG:filelock:Lock 140341412642880 acquired on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock 2024-11-20 14:37:16 DEBUG:filelock:Attempting to release lock 140341412642880 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock 2024-11-20 14:37:16 DEBUG:filelock:Lock 140341412642880 released on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock 2024-11-20 14:37:16 DEBUG:filelock:Attempting to acquire lock 140341412642880 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock 2024-11-20 14:37:16 DEBUG:filelock:Lock 140341412642880 acquired on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock 2024-11-20 14:37:16 DEBUG:filelock:Attempting to release lock 140341412642880 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock 2024-11-20 14:37:16 DEBUG:filelock:Lock 140341412642880 released on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock 2024-11-20 14:37:28 W1120 14:37:28.532000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 93 closing signal SIGTERM 2024-11-20 14:37:28 W1120 14:37:28.533000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 94 closing signal SIGTERM 2024-11-20 14:37:28 W1120 14:37:28.533000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 95 closing signal SIGTERM 2024-11-20 14:37:28 W1120 14:37:28.533000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 97 closing signal SIGTERM 2024-11-20 14:37:28 W1120 14:37:28.534000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 98 closing signal SIGTERM 2024-11-20 14:37:28 W1120 14:37:28.534000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 99 closing signal SIGTERM 2024-11-20 14:37:28 W1120 14:37:28.535000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 100 closing signal SIGTERM 2024-11-20 14:37:58 W1120 14:37:58.535000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 93 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL 2024-11-20 14:38:09 W1120 14:38:09.109000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 94 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL 2024-11-20 14:38:09 W1120 14:38:09.138000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 97 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL 2024-11-20 14:38:09 W1120 14:38:09.165000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 99 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL 2024-11-20 14:38:09 W1120 14:38:09.183000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 100 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL 2024-11-20 14:38:16 E1120 14:38:16.510000 140666529523520 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 3 (pid: 96) of binary: /usr/bin/python 2024-11-20 14:38:16 Traceback (most recent call last): 2024-11-20 14:38:16 File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main 2024-11-20 14:38:16 return _run_code(code, main_globals, None, 2024-11-20 14:38:16 File "/usr/lib/python3.10/runpy.py", line 86, in _run_code 2024-11-20 14:38:16 exec(code, run_globals) 2024-11-20 14:38:16 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 905, in <module> 2024-11-20 14:38:16 main() 2024-11-20 14:38:16 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper 2024-11-20 14:38:16 return f(*args, **kwargs) 2024-11-20 14:38:16 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main 2024-11-20 14:38:16 run(args) 2024-11-20 14:38:16 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run 2024-11-20 14:38:16 elastic_launch( 2024-11-20 14:38:16 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__ 2024-11-20 14:38:16 return launch_agent(self._config, self._entrypoint, list(args)) 2024-11-20 14:38:16 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent 2024-11-20 14:38:16 raise ChildFailedError( 2024-11-20 14:38:16 torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 2024-11-20 14:38:16 ============================================================ 2024-11-20 14:38:16 /Qwen2.5-Coder/finetuning/sft/train.py FAILED 2024-11-20 14:38:16 ------------------------------------------------------------ 2024-11-20 14:38:16 Failures: 2024-11-20 14:38:16 <NO_OTHER_FAILURES> 2024-11-20 14:38:16 ------------------------------------------------------------ 2024-11-20 14:38:16 Root Cause (first observed failure): 2024-11-20 14:38:16 [0]: 2024-11-20 14:38:16 time : 2024-11-20_14:37:28 2024-11-20 14:38:16 host : 584-5971-20241120143511-master-0 2024-11-20 14:38:16 rank : 3 (local_rank: 3) 2024-11-20 14:38:16 exitcode : 1 (pid: 96) 2024-11-20 14:38:16 error_file: <N/A> 2024-11-20 14:38:16 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
is there any difference between two scripts? such as. model_max_length?
I try to fine-tune Qwen2.5-coder but get the
torch.OutOfMemoryError: CUDA out of memory
error, then i try the trainning script of qwen2 and it work. Is there something wrong? Command and the outputs as followsCommand
outputs