"torch.distributed.DistBackendError: NCCL error" when using multiple nodes for training a larger LLM

StwayneXG commented 5 months ago

I trained a Llama2-3B model using OpenRLHF and it trained fine. But when I shifted to the 7B version of the model, I had to shift to multiple nodes and encountered this error. After contacting the support for the server, they said it is most likely because of NCCL. Any help would be appreciated.

GitHub Repo for OpenRLHF: https://github.com/OpenLLMAI/OpenRLHF/

/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /home/coulombc/wheels_builder/tmp.29658/python-3.11/torch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "/lustre06/project/6000242/irtaza11/RLHF/OpenRLHF/examples/scripts/../train_rm.py", line 184, in <module>
    train(args)
  File "/lustre06/project/6000242/irtaza11/RLHF/OpenRLHF/examples/scripts/../train_rm.py", line 112, in train
    trainer.fit(args)
  File "/home/irtaza11/projects/def-zmjiang/irtaza11/RLHF/OpenRLHF/openrlhf/trainer/rm_trainer.py", line 137, in fit
/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /home/coulombc/wheels_builder/tmp.29658/python-3.11/torch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "/lustre06/project/6000242/irtaza11/RLHF/OpenRLHF/examples/scripts/../train_rm.py", line 184, in <module>
    train(args)
  File "/lustre06/project/6000242/irtaza11/RLHF/OpenRLHF/examples/scripts/../train_rm.py", line 112, in train
    trainer.fit(args)
  File "/home/irtaza11/projects/def-zmjiang/irtaza11/RLHF/OpenRLHF/openrlhf/trainer/rm_trainer.py", line 137, in fit
    self.strategy.backward(loss, self.model, self.optimizer)
  File "/home/irtaza11/projects/def-zmjiang/irtaza11/RLHF/OpenRLHF/openrlhf/utils/deepspeed.py", line 97, in backward
    self.strategy.backward(loss, self.model, self.optimizer)
  File "/home/irtaza11/projects/def-zmjiang/irtaza11/RLHF/OpenRLHF/openrlhf/utils/deepspeed.py", line 97, in backward
    model.backward(loss)
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    model.backward(loss)
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1964, in backward
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1964, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2152, in backward
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2152, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/torch/_tensor.py", line 522, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/torch/autograd/__init__.py", line 266, in backward
    torch.autograd.backward(
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1129, in reduce_partition_and_remove_grads
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1129, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param)
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1422, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param)
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1163, in reduce_independent_p_g_buckets_and_remove_grads
    self.__reduce_and_partition_ipg_grads()
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    self.reduce_ready_partitions_and_remove_grads(param)
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1422, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param)
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1163, in reduce_independent_p_g_buckets_and_remove_grads
    self.__reduce_and_partition_ipg_grads()
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1213, in __reduce_and_partition_ipg_grads
    grad_partitions = self.__avg_scatter_grads(self.params_in_ipg_bucket)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1282, in __avg_scatter_grads
    grad_partitions_for_rank = reduce_scatter_coalesced(full_grads_for_rank, self.dp_process_group)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/comm/coalesced_collectives.py", line 120, in reduce_scatter_coalesced
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1213, in __reduce_and_partition_ipg_grads
    grad_partitions = self.__avg_scatter_grads(self.params_in_ipg_bucket)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1282, in __avg_scatter_grads
    grad_partitions_for_rank = reduce_scatter_coalesced(full_grads_for_rank, self.dp_process_group)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/comm/coalesced_collectives.py", line 120, in reduce_scatter_coalesced
    _torch_reduce_scatter_fn(tensor_partition_flat_buffer,
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/comm/coalesced_collectives.py", line 23, in _torch_reduce_scatter_fn
    _torch_reduce_scatter_fn(tensor_partition_flat_buffer,
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/runtime/comm/coalesced_collectives.py", line 23, in _torch_reduce_scatter_fn
    return instrument_w_nvtx(dist.reduce_scatter_fn)(output_tensor, input_tensor, group=group, async_op=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 257, in reduce_scatter_fn
    return instrument_w_nvtx(dist.reduce_scatter_fn)(output_tensor, input_tensor, group=group, async_op=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 257, in reduce_scatter_fn
    return reduce_scatter_tensor(output_tensor,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 289, in reduce_scatter_tensor
    return cdb.reduce_scatter_tensor(output_tensor=output_tensor,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 255, in reduce_scatter_tensor
    return reduce_scatter_tensor(output_tensor,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 289, in reduce_scatter_tensor
    return cdb.reduce_scatter_tensor(output_tensor=output_tensor,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 255, in reduce_scatter_tensor
    return self.reduce_scatter_function(output_tensor,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return self.reduce_scatter_function(output_tensor,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3119, in reduce_scatter_tensor
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3119, in reduce_scatter_tensor
    work = group._reduce_scatter_base(output, input, opts)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /home/coulombc/wheels_builder/tmp.29658/python-3.11/torch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.18.3
ncclInternalError: Internal check failed.
Last error:
Socket recv failed while polling for opId=0x1516cc189080
    work = group._reduce_scatter_base(output, input, opts)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /home/coulombc/wheels_builder/tmp.29658/python-3.11/torch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.18.3
ncclInternalError: Internal check failed.
Last error:
Socket recv failed while polling for opId=0x14bf24188590

sjeaugey commented 5 months ago

NCCL 2.18.3 is mistakenly reporting all network errors as "internal error". You should set NCCL_DEBUG=WARN (or NCCL_DEBUG=INFO although that's more verbose) and that will make NCCL print an error message explaining why the NCCL operation failed. Likely it's a networking setup issue, if you never ran on multiple nodes before.

StwayneXG commented 5 months ago

Here is a rerun of the file with "NCCL_DEBUG=INFO".

[2024-03-25 14:44:07,376] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-25 14:46:39,512] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1,2,3: setting --include=localhost:0,1,2,3
[2024-03-25 14:46:39,512] [INFO] [runner.py:571:main] cmd = /lustre06/project/6000242/irtaza11/RLHF/openrlhf2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None ../train_rm.py --save_path ./ckpt/7b_openllama2_rm --save_steps -1 --logging_steps 1 --eval_steps -1 --train_batch_size 128 --micro_train_batch_size 1 --pretrain /home/irtaza11/projects/def-zmjiang/irtaza11/RLHF/models/self_trained/codellama/CodeLlama-7b-hf/ --bf16 --max_epochs 3 --max_len 2048 --zero_stage 3 --learning_rate 9e-6 --dataset /home/irtaza11/projects/def-zmjiang/irtaza11/RLHF/LIBRO/saved_to_disk/defects4j-rlmf --dataset_probs 1.00 --max_samples 500000 --flash_attn --gradient_checkpointing
[2024-03-25 14:46:41,156] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-25 14:46:43,791] [INFO] [launch.py:138:main] 0 EBVERSIONNCCL=2.18.3
[2024-03-25 14:46:43,791] [INFO] [launch.py:138:main] 0 EBROOTNCCL=/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/CUDA/gcccore/cuda12.2/nccl/2.18.3
[2024-03-25 14:46:43,791] [INFO] [launch.py:138:main] 0 EBDEVELNCCL=/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/CUDA/gcccore/cuda12.2/nccl/2.18.3/easybuild/x86-64-v3-CUDA-gcccore-cuda12.2-nccl-2.18.3-easybuild-devel
[2024-03-25 14:46:43,791] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=INFO
[2024-03-25 14:46:43,791] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2024-03-25 14:46:43,791] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2024-03-25 14:46:43,791] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2024-03-25 14:46:43,791] [INFO] [launch.py:163:main] dist_world_size=4
[2024-03-25 14:46:43,791] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2024-03-25 14:47:51,680] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-25 14:47:51,681] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-25 14:47:51,682] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-25 14:47:51,682] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-03-25 14:48:28,501] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-25 14:48:28,511] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-03-25 14:48:29,459] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-25 14:48:29,462] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-25 14:48:29,469] [INFO] [comm.py:637:init_distributed] cdb=None
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
INFO 03-25 14:48:29 model.py:187] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
INFO 03-25 14:48:29 model.py:187] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
INFO 03-25 14:48:29 model.py:187] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
INFO 03-25 14:48:29 model.py:187] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
INFO 03-25 14:48:29 model.py:187] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
INFO 03-25 14:48:29 model.py:187] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
INFO 03-25 14:48:29 model.py:187] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
INFO 03-25 14:48:29 model.py:187] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
ng30903:3063025:3063025 [0] NCCL INFO Bootstrap : Using ib0:10.82.89.103<0>
ng30903:3063025:3063025 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ng30903:3063025:3063025 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ng30903:3063026:3063026 [1] NCCL INFO cudaDriverVersion 12040
ng30903:3063028:3063028 [3] NCCL INFO cudaDriverVersion 12040
ng30903:3063027:3063027 [2] NCCL INFO cudaDriverVersion 12040
ng30903:3063025:3063025 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.18.3+cuda12.2
ng30903:3063028:3063028 [3] NCCL INFO Bootstrap : Using ib0:10.82.89.103<0>
ng30903:3063027:3063027 [2] NCCL INFO Bootstrap : Using ib0:10.82.89.103<0>
ng30903:3063026:3063026 [1] NCCL INFO Bootstrap : Using ib0:10.82.89.103<0>
ng30903:3063026:3063026 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ng30903:3063026:3063026 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ng30903:3063027:3063027 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ng30903:3063027:3063027 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ng30903:3063028:3063028 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ng30903:3063028:3063028 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ng30903:3063025:3064896 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ib0:10.82.89.103<0>
ng30903:3063025:3064896 [0] NCCL INFO Using network IB
ng30903:3063026:3064900 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ib0:10.82.89.103<0>
ng30903:3063026:3064900 [1] NCCL INFO Using network IB
ng30903:3063028:3064912 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ib0:10.82.89.103<0>
ng30903:3063028:3064912 [3] NCCL INFO Using network IB
ng30903:3063027:3064907 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ib0:10.82.89.103<0>
ng30903:3063027:3064907 [2] NCCL INFO Using network IB
ng30903:3063026:3064900 [1] NCCL INFO comm 0x557bceb28220 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 41000 commId 0x352b6d31ae50e36a - Init START
ng30903:3063027:3064907 [2] NCCL INFO comm 0x55be98f26660 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 81000 commId 0x352b6d31ae50e36a - Init START
ng30903:3063028:3064912 [3] NCCL INFO comm 0x55f461570670 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId c1000 commId 0x352b6d31ae50e36a - Init START
ng30903:3063025:3064896 [0] NCCL INFO comm 0x55a4f74c68f0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x352b6d31ae50e36a - Init START
ng30903:3063025:3064896 [0] NCCL INFO NVLS multicast support is not available on dev 0
ng30903:3063026:3064900 [1] NCCL INFO NVLS multicast support is not available on dev 1
ng30903:3063027:3064907 [2] NCCL INFO NVLS multicast support is not available on dev 2
ng30903:3063028:3064912 [3] NCCL INFO NVLS multicast support is not available on dev 3
ng30903:3063028:3064912 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 [4] -1/-1/-1->3->1 [5] -1/-1/-1->3->1 [6] -1/-1/-1->3->1 [7] -1/-1/-1->3->1 [8] 2/-1/-1->3->0 [9] 2/-1/-1->3->0 [10] 2/-1/-1->3->0 [11] 2/-1/-1->3->0 [12] -1/-1/-1->3->2 [13] -1/-1/-1->3->2 [14] -1/-1/-1->3->2 [15] -1/-1/-1->3->2 [16] -1/-1/-1->3->1 [17] -1/-1/-1->3->1 [18] -1/-1/-1->3->1 [19] -1/-1/-1->3->1 [20] 2/-1/-1->3->0 [21] 2/-1/-1->3->0 [22] 2/-1/-1->3->0 [23] 2/-1/-1->3->0
ng30903:3063028:3064912 [3] NCCL INFO P2P Chunksize set to 524288
ng30903:3063025:3064896 [0] NCCL INFO Channel 00/24 :    0   1   2   3
ng30903:3063025:3064896 [0] NCCL INFO Channel 01/24 :    0   1   3   2
ng30903:3063025:3064896 [0] NCCL INFO Channel 02/24 :    0   2   3   1
ng30903:3063025:3064896 [0] NCCL INFO Channel 03/24 :    0   2   1   3
ng30903:3063025:3064896 [0] NCCL INFO Channel 04/24 :    0   3   1   2
ng30903:3063025:3064896 [0] NCCL INFO Channel 05/24 :    0   3   2   1
ng30903:3063025:3064896 [0] NCCL INFO Channel 06/24 :    0   1   2   3
ng30903:3063025:3064896 [0] NCCL INFO Channel 07/24 :    0   1   3   2
ng30903:3063025:3064896 [0] NCCL INFO Channel 08/24 :    0   2   3   1
ng30903:3063025:3064896 [0] NCCL INFO Channel 09/24 :    0   2   1   3
ng30903:3063025:3064896 [0] NCCL INFO Channel 10/24 :    0   3   1   2
ng30903:3063025:3064896 [0] NCCL INFO Channel 11/24 :    0   3   2   1
ng30903:3063025:3064896 [0] NCCL INFO Channel 12/24 :    0   1   2   3
ng30903:3063025:3064896 [0] NCCL INFO Channel 13/24 :    0   1   3   2
ng30903:3063025:3064896 [0] NCCL INFO Channel 14/24 :    0   2   3   1
ng30903:3063025:3064896 [0] NCCL INFO Channel 15/24 :    0   2   1   3
ng30903:3063025:3064896 [0] NCCL INFO Channel 16/24 :    0   3   1   2
ng30903:3063025:3064896 [0] NCCL INFO Channel 17/24 :    0   3   2   1
ng30903:3063025:3064896 [0] NCCL INFO Channel 18/24 :    0   1   2   3
ng30903:3063025:3064896 [0] NCCL INFO Channel 19/24 :    0   1   3   2
ng30903:3063025:3064896 [0] NCCL INFO Channel 20/24 :    0   2   3   1
ng30903:3063025:3064896 [0] NCCL INFO Channel 21/24 :    0   2   1   3
ng30903:3063025:3064896 [0] NCCL INFO Channel 22/24 :    0   3   1   2
ng30903:3063025:3064896 [0] NCCL INFO Channel 23/24 :    0   3   2   1
ng30903:3063025:3064896 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 2/-1/-1->0->-1 [5] 2/-1/-1->0->-1 [6] 2/-1/-1->0->-1 [7] 2/-1/-1->0->-1 [8] 3/-1/-1->0->1 [9] 3/-1/-1->0->1 [10] 3/-1/-1->0->1 [11] 3/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 2/-1/-1->0->-1 [17] 2/-1/-1->0->-1 [18] 2/-1/-1->0->-1 [19] 2/-1/-1->0->-1 [20] 3/-1/-1->0->1 [21] 3/-1/-1->0->1 [22] 3/-1/-1->0->1 [23] 3/-1/-1->0->1
ng30903:3063025:3064896 [0] NCCL INFO P2P Chunksize set to 524288
ng30903:3063026:3064900 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 3/-1/-1->1->2 [5] 3/-1/-1->1->2 [6] 3/-1/-1->1->2 [7] 3/-1/-1->1->2 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 3/-1/-1->1->2 [17] 3/-1/-1->1->2 [18] 3/-1/-1->1->2 [19] 3/-1/-1->1->2 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
ng30903:3063026:3064900 [1] NCCL INFO P2P Chunksize set to 524288
ng30903:3063027:3064907 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 1/-1/-1->2->0 [5] 1/-1/-1->2->0 [6] 1/-1/-1->2->0 [7] 1/-1/-1->2->0 [8] -1/-1/-1->2->3 [9] -1/-1/-1->2->3 [10] -1/-1/-1->2->3 [11] -1/-1/-1->2->3 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 1/-1/-1->2->0 [17] 1/-1/-1->2->0 [18] 1/-1/-1->2->0 [19] 1/-1/-1->2->0 [20] -1/-1/-1->2->3 [21] -1/-1/-1->2->3 [22] -1/-1/-1->2->3 [23] -1/-1/-1->2->3
ng30903:3063027:3064907 [2] NCCL INFO P2P Chunksize set to 524288
ng30903:3063025:3064896 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 03/0 : 3[3] -> 0[0] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 06/0 : 3[3] -> 0[0] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 09/0 : 3[3] -> 0[0] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 12/0 : 3[3] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 15/0 : 3[3] -> 0[0] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 18/0 : 3[3] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 21/0 : 3[3] -> 0[0] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 01/0 : 1[1] -> 3[3] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 03/0 : 1[1] -> 3[3] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 07/0 : 1[1] -> 3[3] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 02/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 02/0 : 3[3] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 01/0 : 2[2] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 09/0 : 1[1] -> 3[3] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 03/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 04/0 : 3[3] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 08/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 09/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 14/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 04/0 : 2[2] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 13/0 : 1[1] -> 3[3] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 08/0 : 3[3] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 07/0 : 2[2] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 15/0 : 1[1] -> 3[3] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 15/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 10/0 : 2[2] -> 0[0] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 10/0 : 3[3] -> 1[1] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 19/0 : 1[1] -> 3[3] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 21/0 : 1[1] -> 3[3] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 13/0 : 2[2] -> 0[0] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 20/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 14/0 : 3[3] -> 1[1] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 16/0 : 3[3] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 21/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 16/0 : 2[2] -> 0[0] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 20/0 : 3[3] -> 1[1] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 22/0 : 3[3] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 19/0 : 2[2] -> 0[0] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 22/0 : 2[2] -> 0[0] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 04/0 : 0[0] -> 3[3] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 05/0 : 0[0] -> 3[3] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 10/0 : 0[0] -> 3[3] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 11/0 : 0[0] -> 3[3] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 16/0 : 0[0] -> 3[3] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 17/0 : 0[0] -> 3[3] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Connected all rings
ng30903:3063025:3064896 [0] NCCL INFO Channel 22/0 : 0[0] -> 3[3] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 23/0 : 0[0] -> 3[3] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Connected all rings
ng30903:3063026:3064900 [1] NCCL INFO Connected all rings
ng30903:3063028:3064912 [3] NCCL INFO Connected all rings
ng30903:3063025:3064896 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 08/0 : 3[3] -> 0[0] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 10/0 : 3[3] -> 0[0] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 11/0 : 3[3] -> 0[0] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 20/0 : 3[3] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 22/0 : 3[3] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 23/0 : 3[3] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 05/0 : 3[3] -> 1[1] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 06/0 : 3[3] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 04/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 05/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 06/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 04/0 : 1[1] -> 3[3] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 07/0 : 3[3] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 07/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 16/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 17/0 : 3[3] -> 1[1] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 05/0 : 1[1] -> 3[3] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 05/0 : 2[2] -> 0[0] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 18/0 : 3[3] -> 1[1] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 19/0 : 3[3] -> 1[1] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 06/0 : 1[1] -> 3[3] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 17/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 06/0 : 2[2] -> 0[0] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 18/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 16/0 : 1[1] -> 3[3] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 17/0 : 2[2] -> 0[0] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 19/0 : 0[0] -> 2[2] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 17/0 : 1[1] -> 3[3] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 18/0 : 2[2] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 18/0 : 1[1] -> 3[3] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 08/0 : 0[0] -> 3[3] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 09/0 : 0[0] -> 3[3] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 20/0 : 0[0] -> 3[3] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063025:3064896 [0] NCCL INFO Channel 21/0 : 0[0] -> 3[3] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 20/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 21/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Channel 22/0 : 3[3] -> 2[2] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063027:3064907 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063026:3064900 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/IPC/read
ng30903:3063028:3064912 [3] NCCL INFO Connected all trees
ng30903:3063028:3064912 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ng30903:3063028:3064912 [3] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 8 p2p channels per peer
ng30903:3063027:3064907 [2] NCCL INFO Connected all trees
ng30903:3063027:3064907 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ng30903:3063027:3064907 [2] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 8 p2p channels per peer
ng30903:3063026:3064900 [1] NCCL INFO Connected all trees
ng30903:3063026:3064900 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ng30903:3063026:3064900 [1] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 8 p2p channels per peer
ng30903:3063025:3064896 [0] NCCL INFO Connected all trees
ng30903:3063025:3064896 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ng30903:3063025:3064896 [0] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 8 p2p channels per peer
ng30903:3063025:3064896 [0] NCCL INFO comm 0x55a4f74c68f0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x352b6d31ae50e36a - Init COMPLETE
ng30903:3063027:3064907 [2] NCCL INFO comm 0x55be98f26660 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 81000 commId 0x352b6d31ae50e36a - Init COMPLETE
ng30903:3063026:3064900 [1] NCCL INFO comm 0x557bceb28220 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 41000 commId 0x352b6d31ae50e36a - Init COMPLETE
ng30903:3063028:3064912 [3] NCCL INFO comm 0x55f461570670 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId c1000 commId 0x352b6d31ae50e36a - Init COMPLETE
[2024-03-25 14:48:44,972] [INFO] [partition_parameters.py:349:__exit__] finished initializing model - num_params = 291, num_elems = 6.61B

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:21<00:21, 21.13s/it]
Loading checkpoint shards:  50%|█████     | 1/2 [00:21<00:21, 21.13s/it]
Loading checkpoint shards:  50%|█████     | 1/2 [00:21<00:21, 21.13s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:32<00:00, 15.52s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:32<00:00, 16.36s/it]
Some weights of LLMForSequenceRegression were not initialized from the model checkpoint at /home/irtaza11/projects/def-zmjiang/irtaza11/RLHF/models/self_trained/codellama/CodeLlama-7b-hf/ and are newly initialized: ['value_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Loading checkpoint shards: 100%|██████████| 2/2 [00:32<00:00, 15.53s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:32<00:00, 16.37s/it]
Some weights of LLMForSequenceRegression were not initialized from the model checkpoint at /home/irtaza11/projects/def-zmjiang/irtaza11/RLHF/models/self_trained/codellama/CodeLlama-7b-hf/ and are newly initialized: ['value_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO 03-25 14:49:17 model.py:155] initialize value_head for ZeRO-3 reward model training.

Loading checkpoint shards: 100%|██████████| 2/2 [00:32<00:00, 15.54s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:32<00:00, 16.38s/it]
Some weights of LLMForSequenceRegression were not initialized from the model checkpoint at /home/irtaza11/projects/def-zmjiang/irtaza11/RLHF/models/self_trained/codellama/CodeLlama-7b-hf/ and are newly initialized: ['value_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO 03-25 14:49:17 model.py:155] initialize value_head for ZeRO-3 reward model training.
INFO 03-25 14:49:17 model.py:155] initialize value_head for ZeRO-3 reward model training.

Loading checkpoint shards:  50%|█████     | 1/2 [01:04<01:04, 64.65s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [01:26<00:00, 39.59s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [01:26<00:00, 43.35s/it]
Some weights of LLMForSequenceRegression were not initialized from the model checkpoint at /home/irtaza11/projects/def-zmjiang/irtaza11/RLHF/models/self_trained/codellama/CodeLlama-7b-hf/ and are newly initialized: ['value_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO 03-25 14:50:11 model.py:155] initialize value_head for ZeRO-3 reward model training.
/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /home/coulombc/wheels_builder/tmp.29658/python-3.11/torch/torch/csrc/tensor/python_tensor.cpp:83.)
  self._dummy_overflow_buf = get_accelerator().IntTensor([0])
/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /home/coulombc/wheels_builder/tmp.29658/python-3.11/torch/torch/csrc/tensor/python_tensor.cpp:83.)
  self._dummy_overflow_buf = get_accelerator().IntTensor([0])
/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /home/coulombc/wheels_builder/tmp.29658/python-3.11/torch/torch/csrc/tensor/python_tensor.cpp:83.)
  self._dummy_overflow_buf = get_accelerator().IntTensor([0])
LLMForSequenceRegression(
  (model): LlamaModel(
    (embed_tokens): Embedding(32016, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaFlashAttention2(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (value_head): Linear(in_features=4096, out_features=1, bias=False)
)
/lustre06/project/6000242/irtaza11/RLHF/openrlhf2/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /home/coulombc/wheels_builder/tmp.29658/python-3.11/torch/torch/csrc/tensor/python_tensor.cpp:83.)
  self._dummy_overflow_buf = get_accelerator().IntTensor([0])
dataset: /home/irtaza11/projects/def-zmjiang/irtaza11/RLHF/LIBRO/saved_to_disk/defects4j-rlmf
[Dataset({
    features: ['prompt', 'chosen', 'rejected'],
    num_rows: 2895
})]

  0%|          | 0/2895 [00:00<?, ?it/s]
 18%|█▊        | 518/2895 [00:00<00:00, 5159.79it/s]
 43%|████▎     | 1243/2895 [00:00<00:00, 6336.22it/s]
 65%|██████▍   | 1877/2895 [00:00<00:00, 6138.35it/s]
 89%|████████▉ | 2571/2895 [00:00<00:00, 6443.69it/s]
100%|██████████| 2895/2895 [00:00<00:00, 6408.96it/s]

  0%|          | 0/28 [00:00<?, ?it/s]
100%|██████████| 28/28 [00:00<00:00, 35288.62it/s]
[2024-03-25 14:50:13,113] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.0, git-hash=unknown, git-branch=unknown
[2024-03-25 14:50:13,113] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
[2024-03-25 14:50:13,142] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-03-25 14:50:13,150] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-03-25 14:50:13,150] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-03-25 14:50:13,157] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2024-03-25 14:50:13,157] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2024-03-25 14:50:13,157] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2024-03-25 14:50:13,157] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2024-03-25 14:50:13,533] [INFO] [utils.py:791:see_memory_usage] Stage 3 initialize beginning
[2024-03-25 14:50:13,534] [INFO] [utils.py:792:see_memory_usage] MA 3.59 GB         Max_MA 4.08 GB         CA 4.28 GB         Max_CA 4 GB 
[2024-03-25 14:50:13,534] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 27.91 GB, percent = 5.5%
[2024-03-25 14:50:13,535] [INFO] [stage3.py:128:__init__] Reduce bucket size 500,000,000
[2024-03-25 14:50:13,535] [INFO] [stage3.py:129:__init__] Prefetch bucket size 50,000,000
[2024-03-25 14:50:13,847] [INFO] [utils.py:791:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-03-25 14:50:13,847] [INFO] [utils.py:792:see_memory_usage] MA 3.59 GB         Max_MA 3.59 GB         CA 4.28 GB         Max_CA 4 GB 
[2024-03-25 14:50:13,847] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 27.91 GB, percent = 5.5%
Parameter Offload: Total persistent parameters: 270336 in 66 params
[2024-03-25 14:50:14,254] [INFO] [utils.py:791:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-03-25 14:50:14,255] [INFO] [utils.py:792:see_memory_usage] MA 3.59 GB         Max_MA 3.59 GB         CA 4.28 GB         Max_CA 4 GB 
[2024-03-25 14:50:14,255] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 27.91 GB, percent = 5.5%
[2024-03-25 14:50:14,529] [INFO] [utils.py:791:see_memory_usage] Before creating fp16 partitions
[2024-03-25 14:50:14,529] [INFO] [utils.py:792:see_memory_usage] MA 3.59 GB         Max_MA 3.59 GB         CA 4.28 GB         Max_CA 4 GB 
[2024-03-25 14:50:14,529] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 27.91 GB, percent = 5.5%
[2024-03-25 14:50:18,259] [INFO] [utils.py:791:see_memory_usage] After creating fp16 partitions: 3
[2024-03-25 14:50:18,260] [INFO] [utils.py:792:see_memory_usage] MA 3.58 GB         Max_MA 3.59 GB         CA 6.1 GB         Max_CA 6 GB 
[2024-03-25 14:50:18,260] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 34.07 GB, percent = 6.8%
[2024-03-25 14:50:18,637] [INFO] [utils.py:791:see_memory_usage] Before creating fp32 partitions
[2024-03-25 14:50:18,637] [INFO] [utils.py:792:see_memory_usage] MA 3.58 GB         Max_MA 3.58 GB         CA 6.1 GB         Max_CA 6 GB 
[2024-03-25 14:50:18,638] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 27.9 GB, percent = 5.5%
[2024-03-25 14:50:18,983] [INFO] [utils.py:791:see_memory_usage] After creating fp32 partitions
[2024-03-25 14:50:18,983] [INFO] [utils.py:792:see_memory_usage] MA 9.73 GB         Max_MA 10.94 GB         CA 14.12 GB         Max_CA 14 GB 
[2024-03-25 14:50:18,983] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 27.9 GB, percent = 5.5%
[2024-03-25 14:50:19,079] [INFO] [utils.py:791:see_memory_usage] Before initializing optimizer states
[2024-03-25 14:50:19,080] [INFO] [utils.py:792:see_memory_usage] MA 9.73 GB         Max_MA 9.73 GB         CA 14.12 GB         Max_CA 14 GB 
[2024-03-25 14:50:19,080] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 27.9 GB, percent = 5.5%
[2024-03-25 14:50:19,512] [INFO] [utils.py:791:see_memory_usage] After initializing optimizer states
[2024-03-25 14:50:19,513] [INFO] [utils.py:792:see_memory_usage] MA 22.04 GB         Max_MA 25.78 GB         CA 30.17 GB         Max_CA 30 GB 
[2024-03-25 14:50:19,513] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 27.9 GB, percent = 5.5%
[2024-03-25 14:50:19,513] [INFO] [stage3.py:482:_setup_for_real_optimizer] optimizer state initialized
[2024-03-25 14:50:19,931] [INFO] [utils.py:791:see_memory_usage] After initializing ZeRO optimizer
[2024-03-25 14:50:19,932] [INFO] [utils.py:792:see_memory_usage] MA 29.13 GB         Max_MA 29.61 GB         CA 36.32 GB         Max_CA 36 GB 
[2024-03-25 14:50:19,932] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 27.94 GB, percent = 5.5%
[2024-03-25 14:50:19,932] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[2024-03-25 14:50:19,932] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-03-25 14:50:19,932] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x1491682fddd0>
[2024-03-25 14:50:19,932] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95)]
[2024-03-25 14:50:19,933] [INFO] [config.py:984:print] DeepSpeedEngine configuration:
[2024-03-25 14:50:19,933] [INFO] [config.py:988:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-03-25 14:50:19,933] [INFO] [config.py:988:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-03-25 14:50:19,933] [INFO] [config.py:988:print]   amp_enabled .................. False
[2024-03-25 14:50:19,933] [INFO] [config.py:988:print]   amp_params ................... False
[2024-03-25 14:50:19,933] [INFO] [config.py:988:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-03-25 14:50:19,933] [INFO] [config.py:988:print]   bfloat16_enabled ............. True
[2024-03-25 14:50:19,933] [INFO] [config.py:988:print]   checkpoint_parallel_write_pipeline  False
[2024-03-25 14:50:19,933] [INFO] [config.py:988:print]   checkpoint_tag_validation_enabled  True
[2024-03-25 14:50:19,933] [INFO] [config.py:988:print]   checkpoint_tag_validation_fail  False
[2024-03-25 14:50:19,933] [INFO] [config.py:988:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x14919626e110>
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   communication_data_type ...... None
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   curriculum_enabled_legacy .... False
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   curriculum_params_legacy ..... False
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   data_efficiency_enabled ...... False
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   dataloader_drop_last ......... False
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   disable_allgather ............ False
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   dump_state ................... False
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   dynamic_loss_scale_args ...... None
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   eigenvalue_enabled ........... False
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   eigenvalue_gas_boundary_resolution  1
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   eigenvalue_layer_num ......... 0
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   eigenvalue_max_iter .......... 100
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   eigenvalue_stability ......... 1e-06
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   eigenvalue_tol ............... 0.01
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   eigenvalue_verbose ........... False
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   elasticity_enabled ........... False
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   fp16_auto_cast ............... None
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   fp16_enabled ................. False
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   fp16_master_weights_and_gradients  False
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   global_rank .................. 0
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   grad_accum_dtype ............. fp32
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   gradient_accumulation_steps .. 32
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   gradient_clipping ............ 1.0
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   gradient_predivide_factor .... 1.0
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   graph_harvesting ............. False
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   initial_dynamic_scale ........ 1
[2024-03-25 14:50:19,934] [INFO] [config.py:988:print]   load_universal_checkpoint .... False
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   loss_scale ................... 1.0
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   memory_breakdown ............. False
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   mics_hierarchial_params_gather  False
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   mics_shard_size .............. -1
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   optimizer_legacy_fusion ...... False
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   optimizer_name ............... None
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   optimizer_params ............. None
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   pld_enabled .................. False
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   pld_params ................... False
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   prescale_gradients ........... False
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   scheduler_name ............... None
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   scheduler_params ............. None
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   seq_parallel_communication_data_type  torch.float32
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   sparse_attention ............. None
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   sparse_gradients_enabled ..... False
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   steps_per_print .............. 100
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   train_batch_size ............. 128
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   train_micro_batch_size_per_gpu  1
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   use_data_before_expert_parallel_  False
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   use_node_local_storage ....... False
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   wall_clock_breakdown ......... False
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   weight_quantization_config ... None
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   world_size ................... 4
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   zero_allow_untested_optimizer  False
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   zero_enabled ................. True
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   zero_force_ds_cpu_optimizer .. True
[2024-03-25 14:50:19,935] [INFO] [config.py:988:print]   zero_optimization_stage ...... 3
[2024-03-25 14:50:19,936] [INFO] [config.py:974:print_user_config]   json = {
    "steps_per_print": 100, 
    "zero_optimization": {
        "stage": 3, 
        "offload_param": {
            "device": "none"
        }, 
        "offload_optimizer": {
            "device": "none", 
            "pin_memory": true
        }, 
        "sub_group_size": "auto", 
        "stage3_max_live_parameters": "auto", 
        "stage3_max_reuse_distance": "auto", 
        "stage3_param_persistence_threshold": "auto", 
        "stage3_prefetch_bucket_size": "auto", 
        "reduce_bucket_size": "auto", 
        "zero_hpz_partition_size": 1, 
        "zero_quantized_weights": false, 
        "zero_quantized_gradients": false
    }, 
    "bf16": {
        "enabled": true
    }, 
    "gradient_clipping": 1.0, 
    "prescale_gradients": false, 
    "wall_clock_breakdown": false, 
    "data_types": {
        "grad_accum_dtype": "fp32"
    }, 
    "train_micro_batch_size_per_gpu": 1, 
    "train_batch_size": 128
}
LogSigmoid Loss

Train epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Train step of epoch 0:   0%|          | 0/723 [00:00<?, ?it/s][A`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
ng30903:3063026:3066050 [1] NCCL INFO Using network IB
ng30903:3063027:3066051 [2] NCCL INFO Using network IB
ng30903:3063028:3066049 [3] NCCL INFO Using network IB
ng30903:3063025:3066048 [0] NCCL INFO Using network IB
ng30903:3063025:3066048 [0] NCCL INFO bootstrapSplit: rank 0 nranks 4 color 116666945 key 0 prev 3 next 1 - DONE
ng30903:3063025:3066048 [0] NCCL INFO comm 0x149160c414d0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x3c0cd08e83c8e167 - Init START
ng30903:3063026:3066050 [1] NCCL INFO bootstrapSplit: rank 1 nranks 4 color 116666945 key 1 prev 0 next 2 - DONE
ng30903:3063026:3066050 [1] NCCL INFO comm 0x147f64ca0fa0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 41000 commId 0x3c0cd08e83c8e167 - Init START
ng30903:3063027:3066051 [2] NCCL INFO bootstrapSplit: rank 2 nranks 4 color 116666945 key 2 prev 1 next 3 - DONE
ng30903:3063027:3066051 [2] NCCL INFO comm 0x148592513060 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 81000 commId 0x3c0cd08e83c8e167 - Init START
ng30903:3063028:3066049 [3] NCCL INFO bootstrapSplit: rank 3 nranks 4 color 116666945 key 3 prev 2 next 0 - DONE
ng30903:3063028:3066049 [3] NCCL INFO comm 0x146392b329a0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId c1000 commId 0x3c0cd08e83c8e167 - Init START
ng30903:3063027:3066051 [2] NCCL INFO NVLS multicast support is not available on dev 2
ng30903:3063025:3066048 [0] NCCL INFO NVLS multicast support is not available on dev 0
ng30903:3063028:3066049 [3] NCCL INFO NVLS multicast support is not available on dev 3
ng30903:3063026:3066050 [1] NCCL INFO NVLS multicast support is not available on dev 1
ng30903:3063026:3066050 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 3/-1/-1->1->2 [5] 3/-1/-1->1->2 [6] 3/-1/-1->1->2 [7] 3/-1/-1->1->2 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 3/-1/-1->1->2 [17] 3/-1/-1->1->2 [18] 3/-1/-1->1->2 [19] 3/-1/-1->1->2 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
ng30903:3063026:3066050 [1] NCCL INFO P2P Chunksize set to 524288
ng30903:3063028:3066049 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 [4] -1/-1/-1->3->1 [5] -1/-1/-1->3->1 [6] -1/-1/-1->3->1 [7] -1/-1/-1->3->1 [8] 2/-1/-1->3->0 [9] 2/-1/-1->3->0 [10] 2/-1/-1->3->0 [11] 2/-1/-1->3->0 [12] -1/-1/-1->3->2 [13] -1/-1/-1->3->2 [14] -1/-1/-1->3->2 [15] -1/-1/-1->3->2 [16] -1/-1/-1->3->1 [17] -1/-1/-1->3->1 [18] -1/-1/-1->3->1 [19] -1/-1/-1->3->1 [20] 2/-1/-1->3->0 [21] 2/-1/-1->3->0 [22] 2/-1/-1->3->0 [23] 2/-1/-1->3->0
ng30903:3063028:3066049 [3] NCCL INFO P2P Chunksize set to 524288
ng30903:3063027:3066051 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 1/-1/-1->2->0 [5] 1/-1/-1->2->0 [6] 1/-1/-1->2->0 [7] 1/-1/-1->2->0 [8] -1/-1/-1->2->3 [9] -1/-1/-1->2->3 [10] -1/-1/-1->2->3 [11] -1/-1/-1->2->3 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 1/-1/-1->2->0 [17] 1/-1/-1->2->0 [18] 1/-1/-1->2->0 [19] 1/-1/-1->2->0 [20] -1/-1/-1->2->3 [21] -1/-1/-1->2->3 [22] -1/-1/-1->2->3 [23] -1/-1/-1->2->3
ng30903:3063027:3066051 [2] NCCL INFO P2P Chunksize set to 524288
ng30903:3063025:3066048 [0] NCCL INFO Channel 00/24 :    0   1   2   3
ng30903:3063025:3066048 [0] NCCL INFO Channel 01/24 :    0   1   3   2
ng30903:3063025:3066048 [0] NCCL INFO Channel 02/24 :    0   2   3   1
ng30903:3063025:3066048 [0] NCCL INFO Channel 03/24 :    0   2   1   3
ng30903:3063025:3066048 [0] NCCL INFO Channel 04/24 :    0   3   1   2
ng30903:3063025:3066048 [0] NCCL INFO Channel 05/24 :    0   3   2   1
ng30903:3063025:3066048 [0] NCCL INFO Channel 06/24 :    0   1   2   3
ng30903:3063025:3066048 [0] NCCL INFO Channel 07/24 :    0   1   3   2
ng30903:3063025:3066048 [0] NCCL INFO Channel 08/24 :    0   2   3   1
ng30903:3063025:3066048 [0] NCCL INFO Channel 09/24 :    0   2   1   3
ng30903:3063025:3066048 [0] NCCL INFO Channel 10/24 :    0   3   1   2
ng30903:3063025:3066048 [0] NCCL INFO Channel 11/24 :    0   3   2   1
ng30903:3063025:3066048 [0] NCCL INFO Channel 12/24 :    0   1   2   3
ng30903:3063025:3066048 [0] NCCL INFO Channel 13/24 :    0   1   3   2
ng30903:3063025:3066048 [0] NCCL INFO Channel 14/24 :    0   2   3   1
ng30903:3063025:3066048 [0] NCCL INFO Channel 15/24 :    0   2   1   3
ng30903:3063025:3066048 [0] NCCL INFO Channel 16/24 :    0   3   1   2
ng30903:3063025:3066048 [0] NCCL INFO Channel 17/24 :    0   3   2   1
ng30903:3063025:3066048 [0] NCCL INFO Channel 18/24 :    0   1   2   3
ng30903:3063025:3066048 [0] NCCL INFO Channel 19/24 :    0   1   3   2
ng30903:3063025:3066048 [0] NCCL INFO Channel 20/24 :    0   2   3   1
ng30903:3063025:3066048 [0] NCCL INFO Channel 21/24 :    0   2   1   3
ng30903:3063025:3066048 [0] NCCL INFO Channel 22/24 :    0   3   1   2
ng30903:3063025:3066048 [0] NCCL INFO Channel 23/24 :    0   3   2   1
ng30903:3063025:3066048 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 2/-1/-1->0->-1 [5] 2/-1/-1->0->-1 [6] 2/-1/-1->0->-1 [7] 2/-1/-1->0->-1 [8] 3/-1/-1->0->1 [9] 3/-1/-1->0->1 [10] 3/-1/-1->0->1 [11] 3/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 2/-1/-1->0->-1 [17] 2/-1/-1->0->-1 [18] 2/-1/-1->0->-1 [19] 2/-1/-1->0->-1 [20] 3/-1/-1->0->1 [21] 3/-1/-1->0->1 [22] 3/-1/-1->0->1 [23] 3/-1/-1->0->1
ng30903:3063025:3066048 [0] NCCL INFO P2P Chunksize set to 524288
ng30903:3063026:3066050 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063026:3066050 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC/read

ng30903:3063025:3066154 [0] include/alloc.h:178 NCCL WARN Cuda failure 'out of memory'

ng30903:3063025:3066154 [0] include/alloc.h:185 NCCL WARN Failed to CUDA calloc 6291456 bytes
ng30903:3063025:3066154 [0] NCCL INFO transport/p2p.cc:204 -> 1
ng30903:3063025:3066154 [0] NCCL INFO transport/p2p.cc:605 -> 1
ng30903:3063025:3066154 [0] NCCL INFO proxy.cc:1303 -> 1
ng30903:3063025:3066154 [0] NCCL INFO proxy.cc:1377 -> 1

ng30903:3063025:3066154 [0] proxy.cc:1519 NCCL WARN [Proxy Service 0] Failed to execute operation Setup from rank 0, retcode 1
ng30903:3063027:3066051 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read

ng30903:3063025:3066048 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer ng30903-ib.narval.calcul.quebec<51159>
ng30903:3063025:3066048 [0] NCCL INFO misc/socket.cc:749 -> 6

ng30903:3063025:3066048 [0] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x55a51966cc80
ng30903:3063025:3066048 [0] NCCL INFO transport/p2p.cc:438 -> 3
ng30903:3063025:3066048 [0] NCCL INFO transport.cc:33 -> 3
ng30903:3063025:3066048 [0] NCCL INFO transport.cc:97 -> 3
ng30903:3063025:3066048 [0] NCCL INFO init.cc:1079 -> 3
ng30903:3063025:3066048 [0] NCCL INFO init.cc:1358 -> 3
ng30903:3063025:3066048 [0] NCCL INFO group.cc:65 -> 3 [Async thread]
ng30903:3063025:3066031 [0] NCCL INFO group.cc:406 -> 3
ng30903:3063025:3066031 [0] NCCL INFO group.cc:96 -> 3

ng30903:3063028:3066152 [3] include/alloc.h:178 NCCL WARN Cuda failure 'out of memory'

ng30903:3063028:3066152 [3] include/alloc.h:185 NCCL WARN Failed to CUDA calloc 6291456 bytes
ng30903:3063028:3066152 [3] NCCL INFO transport/p2p.cc:204 -> 1
ng30903:3063028:3066152 [3] NCCL INFO transport/p2p.cc:605 -> 1
ng30903:3063028:3066152 [3] NCCL INFO proxy.cc:1303 -> 1
ng30903:3063028:3066152 [3] NCCL INFO proxy.cc:1377 -> 1

ng30903:3063028:3066152 [3] proxy.cc:1519 NCCL WARN [Proxy Service 3] Failed to execute operation Setup from rank 3, retcode 1
ng30903:3063026:3066050 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063027:3066051 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC/read

ng30903:3063028:3066049 [3] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer ng30903-ib.narval.calcul.quebec<54683>
ng30903:3063028:3066049 [3] NCCL INFO misc/socket.cc:749 -> 6

ng30903:3063028:3066049 [3] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x14638c188590
ng30903:3063028:3066049 [3] NCCL INFO transport/p2p.cc:438 -> 3
ng30903:3063028:3066049 [3] NCCL INFO transport.cc:33 -> 3
ng30903:3063028:3066049 [3] NCCL INFO transport.cc:97 -> 3
ng30903:3063028:3066049 [3] NCCL INFO init.cc:1079 -> 3
ng30903:3063028:3066049 [3] NCCL INFO init.cc:1358 -> 3
ng30903:3063028:3066049 [3] NCCL INFO group.cc:65 -> 3 [Async thread]
ng30903:3063028:3066042 [3] NCCL INFO group.cc:406 -> 3
ng30903:3063028:3066042 [3] NCCL INFO group.cc:96 -> 3
ng30903:3063026:3066050 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063027:3066051 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063026:3066050 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063027:3066051 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063026:3066050 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063027:3066051 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063027:3066051 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063027:3066051 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063026:3066050 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/IPC/read
ng30903:3063027:3066051 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/IPC/read
ng30903:3063026:3066050 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/IPC/read

AddyLaddy commented 5 months ago

This looks to be the culprit:

ng30903:3063025:3066154 [0] include/alloc.h:178 NCCL WARN Cuda failure 'out of memory'
ng30903:3063025:3066154 [0] include/alloc.h:185 NCCL WARN Failed to CUDA calloc 6291456 bytes
ng30903:3063025:3066154 [0] NCCL INFO transport/p2p.cc:204 -> 1
ng30903:3063025:3066154 [0] NCCL INFO transport/p2p.cc:605 -> 1
ng30903:3063025:3066154 [0] NCCL INFO proxy.cc:1303 -> 1
ng30903:3063025:3066154 [0] NCCL INFO proxy.cc:1377 -> 1

sjeaugey commented 5 months ago

You should reduce the amount of memory you use and leave more for NCCL, then see how it goes. Then you could decide to reduce the amount of memory NCCL uses (which may reduce NCCL performance) but I'd advise to first get the memory issue out of the picture.

NVIDIA / nccl

"torch.distributed.DistBackendError: NCCL error" when using multiple nodes for training a larger LLM #1235