Closed Hippo88902 closed 1 month ago
Hi @Hippo88902,
Can you try with 2 or 4 GPUs, just so to isolate whether the issue is in all multi-GPU setups or just when using 10 GPUs
Thanks for your reply! The error still happened when I used 2 or 4 GPUs.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:14<00:00, 3.69s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:15<00:00, 3.80s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:15<00:00, 3.80s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:14<00:00, 3.64s/it]
[rank2]: Traceback (most recent call last):
[rank2]: File "/nfs/RS2416RP/Workspace/ming/hippo/llm2vec/experiments/run_supervised.py", line 472, in <module>
[rank2]: main()
[rank2]: File "/nfs/RS2416RP/Workspace/ming/hippo/llm2vec/experiments/run_supervised.py", line 431, in main
[rank2]: model = LLM2Vec.from_pretrained(
[rank2]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/llm2vec/llm2vec.py", line 107, in from_pretrained
[rank2]: model = PeftModel.from_pretrained(
[rank2]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/peft_model.py", line 430, in from_pretrained
[rank2]: model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
[rank2]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/peft_model.py", line 984, in load_adapter
[rank2]: adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)
[rank2]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 444, in load_peft_weights
[rank2]: adapters_weights = safe_load_file(filename, device=device)
[rank2]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/safetensors/torch.py", line 313, in load_file
[rank2]: result[k] = f.get_tensor(k)
[rank2]: RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
[rank2]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank2]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank2]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[rank3]: Traceback (most recent call last):
[rank3]: File "/nfs/RS2416RP/Workspace/ming/hippo/llm2vec/experiments/run_supervised.py", line 472, in <module>
[rank3]: main()
[rank3]: File "/nfs/RS2416RP/Workspace/ming/hippo/llm2vec/experiments/run_supervised.py", line 431, in main
[rank3]: model = LLM2Vec.from_pretrained(
[rank3]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/llm2vec/llm2vec.py", line 107, in from_pretrained
[rank3]: model = PeftModel.from_pretrained(
[rank3]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/peft_model.py", line 430, in from_pretrained
[rank3]: model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
[rank3]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/peft_model.py", line 984, in load_adapter
[rank3]: adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)
[rank3]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 444, in load_peft_weights
[rank3]: adapters_weights = safe_load_file(filename, device=device)
[rank3]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/safetensors/torch.py", line 313, in load_file
[rank3]: result[k] = f.get_tensor(k)
[rank3]: RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
[rank3]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank3]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank3]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[rank1]: Traceback (most recent call last):
[rank1]: File "/nfs/RS2416RP/Workspace/ming/hippo/llm2vec/experiments/run_supervised.py", line 472, in <module>
[rank1]: main()
[rank1]: File "/nfs/RS2416RP/Workspace/ming/hippo/llm2vec/experiments/run_supervised.py", line 431, in main
[rank1]: model = LLM2Vec.from_pretrained(
[rank1]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/llm2vec/llm2vec.py", line 107, in from_pretrained
[rank1]: model = PeftModel.from_pretrained(
[rank1]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/peft_model.py", line 430, in from_pretrained
[rank1]: model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
[rank1]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/peft_model.py", line 984, in load_adapter
[rank1]: adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)
[rank1]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 444, in load_peft_weights
[rank1]: adapters_weights = safe_load_file(filename, device=device)
[rank1]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/safetensors/torch.py", line 313, in load_file
[rank1]: result[k] = f.get_tensor(k)
[rank1]: RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
W0528 02:39:15.124000 139924873729856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1221066 closing signal SIGTERM
E0528 02:39:16.543000 139924873729856 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 1221067) of binary: /home/ming/.anaconda3/envs/llm2vec/bin/python
Traceback (most recent call last):
File "/home/ming/.anaconda3/envs/llm2vec/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
experiments/run_supervised.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-05-28_02:39:15
host : gpu10
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 1221068)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-05-28_02:39:15
host : gpu10
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 1221069)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-28_02:39:15
host : gpu10
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1221067)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Can you try running one of the huggingface example scripts, run_clm.py on multiple GPUs?
I will help to know if the issue is with our specific script or any multi-gpu training script
Hi, I have tried running the example scripts with a similar configuration many times. Here are my commands.
torchrun --nproc_per_node=4 examples/pytorch/language-modeling/run_clm.py\
--model_name_or_path "meta-llama/Llama-2-7b-hf"\
--num_train_epochs 3\
--dataset_name wikitext\
--dataset_config_name wikitext-103-raw-v1\
--per_device_train_batch_size 1\
--per_device_eval_batch_size 1\
--do_train\
--do_eval \
--output_dir output/Meta-Llama-2-7b-hf-test\
--overwrite_output_dir\
--block_size 2048\
--torch_dtype bfloat16
However, I kept getting a CUDA out of memory error.
[rank2]: Traceback (most recent call last):
[rank2]: File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 656, in <module>
[rank2]: main()
[rank2]: File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 604, in main
[rank2]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank2]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 1912, in train
[rank2]: return inner_training_loop(
[rank2]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 2069, in _inner_training_loop
[rank2]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank2]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank2]: result = tuple(
[rank2]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank2]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank2]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank2]: return self.prepare_model(obj, device_placement=device_placement)
[rank2]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank2]: model = torch.nn.parallel.DistributedDataParallel(
[rank2]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank2]: self._ddp_init_helper(
[rank2]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank2]: self.reducer = dist.Reducer(
[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU has a total capacity of 23.65 GiB of which 28.69 MiB is free. Including non-PyTorch memory, this process has 23.62 GiB memory in use. Of the allocated memory 23.11 GiB is allocated by PyTorch, and 1.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank3]: Traceback (most recent call last):
[rank3]: File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 656, in <module>
[rank3]: main()
[rank3]: File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 604, in main
[rank3]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank3]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 1912, in train
[rank3]: return inner_training_loop(
[rank3]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 2069, in _inner_training_loop
[rank3]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank3]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank3]: result = tuple(
[rank3]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank3]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank3]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank3]: return self.prepare_model(obj, device_placement=device_placement)
[rank3]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank3]: model = torch.nn.parallel.DistributedDataParallel(
[rank3]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank3]: self._ddp_init_helper(
[rank3]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank3]: self.reducer = dist.Reducer(
[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU has a total capacity of 23.65 GiB of which 28.69 MiB is free. Including non-PyTorch memory, this process has 23.62 GiB memory in use. Of the allocated memory 23.11 GiB is allocated by PyTorch, and 1.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank1]: Traceback (most recent call last):
[rank1]: File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 656, in <module>
[rank1]: main()
[rank1]: File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 604, in main
[rank1]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank1]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 1912, in train
[rank1]: return inner_training_loop(
[rank1]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 2069, in _inner_training_loop
[rank1]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank1]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank1]: result = tuple(
[rank1]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank1]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank1]: return self.prepare_model(obj, device_placement=device_placement)
[rank1]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank1]: model = torch.nn.parallel.DistributedDataParallel(
[rank1]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank1]: self._ddp_init_helper(
[rank1]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank1]: self.reducer = dist.Reducer(
[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU has a total capacity of 23.65 GiB of which 28.69 MiB is free. Including non-PyTorch memory, this process has 23.62 GiB memory in use. Of the allocated memory 23.11 GiB is allocated by PyTorch, and 1.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]: File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 656, in <module>
[rank0]: main()
[rank0]: File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 604, in main
[rank0]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank0]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 1912, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 2069, in _inner_training_loop
[rank0]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank0]: result = tuple(
[rank0]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank0]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank0]: return self.prepare_model(obj, device_placement=device_placement)
[rank0]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank0]: model = torch.nn.parallel.DistributedDataParallel(
[rank0]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank0]: self._ddp_init_helper(
[rank0]: File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank0]: self.reducer = dist.Reducer(
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU
E0529 02:12:56.615000 139714251892544 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 2087109) of binary: /home/ming/.anaconda3/envs/llm2vec/bin/python
By the way, I added these two lines to my run.clm.py code
os.environ['NCCL_P2P_DISABLE']="1"
os.environ['NCCL_IB_DISABLE']="1"
Otherwise, I would encounter this error.
NotImplementedError: Using RTX 4000 series doesn't support faster communication broadband via P2P or IB. Please set NCCL_P2P_DISABLE="1" and NCCL_IB_DISABLE="1" or use accelerate launch` which will do this automatically.
Thanks for your help.
Unfortunately I am not able to reproduce the issue. Both scripts are running on my end. I have tried it on 8XA100, 8XV100 8XH100 and 8XA6000. I don't have access to RTX 4000 so I cannot exactly replicate the hardware setup.
Closing as it cannot be reproduced.
Thanks for sharing this project ! I want to perform the supervised training task, but I am getting the following error.
I used 10 GPUs to run the supervised training task.
Below are the environment settings for my server. Pytorch version : 2.3.0+cu121 CUDA Version : 12.4 CUDA Home : /usr/local/cuda Available GPUs : 10 x RTX 4090
System :
It worked fine when I used single GPU.
Thanks.