RuntimeError: CUDA error: invalid device ordinal, Compile with TORCH_USE_CUDA_DSA to enable device-side assertions

Hippo88902 commented 1 month ago

Thanks for sharing this project ! I want to perform the supervised training task, but I am getting the following error.

torchrun --nproc_per_node=10 experiments/run_supervised.py train_configs/supervised/MetaLlama3.json

[rank4]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/hfargparser.py", line 398, in parsejsonfile
[rank4]:     outputs = self.parsedict(data, allow_extra_keys=allow_extra_keys)
[rank4]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/hf_argparser.py", line 373, in parse_dict
[rank4]:     obj = dtype(**inputs)
[rank4]:   File "<string>", line 124, in __init
[rank4]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/training_args.py", line 1551, in __post_init
[rank4]:     and (self.device.type != "cuda")
[rank4]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/training_args.py", line 2027, in device
[rank4]:     return self.setupdevices
[rank4]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get
[rank4]:     cached = self.fget(obj)
[rank4]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/training_args.py", line 1963, in setupdevices
[rank4]:     self.distributedstate = PartialState(
[rank4]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/state.py", line 281, in _init
[rank4]:     self.set_device()
[rank4]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/state.py", line 793, in set_device
[rank4]:     torch.cuda.set_device(self.device)
[rank4]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/cuda/__init.py", line 399, in set_device
[rank4]:     torch._C._cuda_setDevice(device)
[rank4]: RuntimeError: CUDA error: invalid device ordinal
[rank4]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I used 10 GPUs to run the supervised training task.

Below are the environment settings for my server. Pytorch version : 2.3.0+cu121 CUDA Version : 12.4 CUDA Home : /usr/local/cuda Available GPUs : 10 x RTX 4090

System :

PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

It worked fine when I used single GPU.

torchrun --nproc_per_node=1 experiments/run_supervised.py train_configs/supervised/MetaLlama3.json

Thanks.

vaibhavad commented 1 month ago

Hi @Hippo88902,

Can you try with 2 or 4 GPUs, just so to isolate whether the issue is in all multi-GPU setups or just when using 10 GPUs

Hippo88902 commented 1 month ago

Thanks for your reply! The error still happened when I used 2 or 4 GPUs.

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:14<00:00,  3.69s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:15<00:00,  3.80s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:15<00:00,  3.80s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:14<00:00,  3.64s/it]
[rank2]: Traceback (most recent call last):
[rank2]:   File "/nfs/RS2416RP/Workspace/ming/hippo/llm2vec/experiments/run_supervised.py", line 472, in <module>
[rank2]:     main()
[rank2]:   File "/nfs/RS2416RP/Workspace/ming/hippo/llm2vec/experiments/run_supervised.py", line 431, in main
[rank2]:     model = LLM2Vec.from_pretrained(
[rank2]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/llm2vec/llm2vec.py", line 107, in from_pretrained
[rank2]:     model = PeftModel.from_pretrained(
[rank2]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/peft_model.py", line 430, in from_pretrained
[rank2]:     model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
[rank2]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/peft_model.py", line 984, in load_adapter
[rank2]:     adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)
[rank2]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 444, in load_peft_weights
[rank2]:     adapters_weights = safe_load_file(filename, device=device)
[rank2]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/safetensors/torch.py", line 313, in load_file
[rank2]:     result[k] = f.get_tensor(k)
[rank2]: RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
[rank2]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank2]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank2]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank3]: Traceback (most recent call last):
[rank3]:   File "/nfs/RS2416RP/Workspace/ming/hippo/llm2vec/experiments/run_supervised.py", line 472, in <module>
[rank3]:     main()
[rank3]:   File "/nfs/RS2416RP/Workspace/ming/hippo/llm2vec/experiments/run_supervised.py", line 431, in main
[rank3]:     model = LLM2Vec.from_pretrained(
[rank3]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/llm2vec/llm2vec.py", line 107, in from_pretrained
[rank3]:     model = PeftModel.from_pretrained(
[rank3]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/peft_model.py", line 430, in from_pretrained
[rank3]:     model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
[rank3]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/peft_model.py", line 984, in load_adapter
[rank3]:     adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)
[rank3]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 444, in load_peft_weights
[rank3]:     adapters_weights = safe_load_file(filename, device=device)
[rank3]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/safetensors/torch.py", line 313, in load_file
[rank3]:     result[k] = f.get_tensor(k)
[rank3]: RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
[rank3]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank3]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank3]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank1]: Traceback (most recent call last):
[rank1]:   File "/nfs/RS2416RP/Workspace/ming/hippo/llm2vec/experiments/run_supervised.py", line 472, in <module>
[rank1]:     main()
[rank1]:   File "/nfs/RS2416RP/Workspace/ming/hippo/llm2vec/experiments/run_supervised.py", line 431, in main
[rank1]:     model = LLM2Vec.from_pretrained(
[rank1]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/llm2vec/llm2vec.py", line 107, in from_pretrained
[rank1]:     model = PeftModel.from_pretrained(
[rank1]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/peft_model.py", line 430, in from_pretrained
[rank1]:     model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
[rank1]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/peft_model.py", line 984, in load_adapter
[rank1]:     adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)
[rank1]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 444, in load_peft_weights
[rank1]:     adapters_weights = safe_load_file(filename, device=device)
[rank1]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/safetensors/torch.py", line 313, in load_file
[rank1]:     result[k] = f.get_tensor(k)
[rank1]: RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

W0528 02:39:15.124000 139924873729856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1221066 closing signal SIGTERM
E0528 02:39:16.543000 139924873729856 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 1221067) of binary: /home/ming/.anaconda3/envs/llm2vec/bin/python
Traceback (most recent call last):
  File "/home/ming/.anaconda3/envs/llm2vec/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
experiments/run_supervised.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-28_02:39:15
  host      : gpu10
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 1221068)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-05-28_02:39:15
  host      : gpu10
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 1221069)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-28_02:39:15
  host      : gpu10
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1221067)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

vaibhavad commented 1 month ago

Can you try running one of the huggingface example scripts, run_clm.py on multiple GPUs?

I will help to know if the issue is with our specific script or any multi-gpu training script

Hippo88902 commented 1 month ago

Hi, I have tried running the example scripts with a similar configuration many times. Here are my commands.

torchrun --nproc_per_node=4 examples/pytorch/language-modeling/run_clm.py\
     --model_name_or_path "meta-llama/Llama-2-7b-hf"\
     --num_train_epochs 3\
     --dataset_name wikitext\
     --dataset_config_name wikitext-103-raw-v1\
     --per_device_train_batch_size 1\
     --per_device_eval_batch_size 1\
     --do_train\
     --do_eval \
     --output_dir output/Meta-Llama-2-7b-hf-test\
     --overwrite_output_dir\
     --block_size 2048\
     --torch_dtype bfloat16

However, I kept getting a CUDA out of memory error.

[rank2]: Traceback (most recent call last):
[rank2]:   File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 656, in <module>
[rank2]:     main()
[rank2]:   File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 604, in main
[rank2]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank2]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 1912, in train
[rank2]:     return inner_training_loop(
[rank2]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 2069, in _inner_training_loop
[rank2]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank2]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank2]:     result = tuple(
[rank2]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank2]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank2]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank2]:     return self.prepare_model(obj, device_placement=device_placement)
[rank2]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank2]:     model = torch.nn.parallel.DistributedDataParallel(
[rank2]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank2]:     self._ddp_init_helper(
[rank2]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank2]:     self.reducer = dist.Reducer(
[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU  has a total capacity of 23.65 GiB of which 28.69 MiB is free. Including non-PyTorch memory, this process has 23.62 GiB memory in use. Of the allocated memory 23.11 GiB is allocated by PyTorch, and 1.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank3]: Traceback (most recent call last):
[rank3]:   File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 656, in <module>
[rank3]:     main()
[rank3]:   File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 604, in main
[rank3]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank3]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 1912, in train
[rank3]:     return inner_training_loop(
[rank3]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 2069, in _inner_training_loop
[rank3]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank3]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank3]:     result = tuple(
[rank3]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank3]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank3]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank3]:     return self.prepare_model(obj, device_placement=device_placement)
[rank3]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank3]:     model = torch.nn.parallel.DistributedDataParallel(
[rank3]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank3]:     self._ddp_init_helper(
[rank3]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank3]:     self.reducer = dist.Reducer(
[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU  has a total capacity of 23.65 GiB of which 28.69 MiB is free. Including non-PyTorch memory, this process has 23.62 GiB memory in use. Of the allocated memory 23.11 GiB is allocated by PyTorch, and 1.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank1]: Traceback (most recent call last):
[rank1]:   File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 656, in <module>
[rank1]:     main()
[rank1]:   File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 604, in main
[rank1]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank1]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 1912, in train
[rank1]:     return inner_training_loop(
[rank1]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 2069, in _inner_training_loop
[rank1]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank1]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank1]:     result = tuple(
[rank1]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank1]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank1]:     return self.prepare_model(obj, device_placement=device_placement)
[rank1]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank1]:     model = torch.nn.parallel.DistributedDataParallel(
[rank1]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank1]:     self._ddp_init_helper(
[rank1]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank1]:     self.reducer = dist.Reducer(
[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU  has a total capacity of 23.65 GiB of which 28.69 MiB is free. Including non-PyTorch memory, this process has 23.62 GiB memory in use. Of the allocated memory 23.11 GiB is allocated by PyTorch, and 1.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 656, in <module>
[rank0]:     main()
[rank0]:   File "/nfs/RS2416RP/Workspace/ming/hippo/transformers/examples/pytorch/language-modeling/run_clm.py", line 604, in main
[rank0]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank0]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 1912, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/transformers/trainer.py", line 2069, in _inner_training_loop
[rank0]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank0]:     result = tuple(
[rank0]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank0]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank0]:     return self.prepare_model(obj, device_placement=device_placement)
[rank0]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank0]:     model = torch.nn.parallel.DistributedDataParallel(
[rank0]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank0]:     self._ddp_init_helper(
[rank0]:   File "/home/ming/.anaconda3/envs/llm2vec/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank0]:     self.reducer = dist.Reducer(
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU
E0529 02:12:56.615000 139714251892544 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 2087109) of binary: /home/ming/.anaconda3/envs/llm2vec/bin/python

By the way, I added these two lines to my run.clm.py code

os.environ['NCCL_P2P_DISABLE']="1"
os.environ['NCCL_IB_DISABLE']="1"

Otherwise, I would encounter this error.

NotImplementedError: Using RTX 4000 series doesn't support faster communication broadband via P2P or IB. Please set NCCL_P2P_DISABLE="1" and NCCL_IB_DISABLE="1" or use accelerate launch` which will do this automatically.

Thanks for your help.

vaibhavad commented 1 month ago

Unfortunately I am not able to reproduce the issue. Both scripts are running on my end. I have tried it on 8XA100, 8XV100 8XH100 and 8XA6000. I don't have access to RTX 4000 so I cannot exactly replicate the hardware setup.

vaibhavad commented 1 month ago

Closing as it cannot be reproduced.

McGill-NLP / llm2vec

RuntimeError: CUDA error: invalid device ordinal, Compile with TORCH_USE_CUDA_DSA to enable device-side assertions #84