Closed hellangleZ closed 3 days ago
Same Error here
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
even I use 24.05.01, still stuck in this status
use another sft script which refer by official document is also not work by this error
https://docs.nvidia.com/nemo-framework/user-guide/latest/modelalignment/sft.html#step-2-sft-training
Traceback (most recent call last): File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 243, in main() File "/opt/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper _run_hydra( File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra _run_app( File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app run_and_report( File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report raise ex File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report return func() File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in lambda: hydra.run( File "/usr/local/lib/python3.10/dist-packages/hydra/internal/hydra.py", line 132, in run = ret.return_value File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 129, in main ptl_model, updated_cfg = load_from_nemo( File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 98, in load_from_nemo model = cls.restore_from( File "/opt/NeMo/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from return super().restore_from( File "/opt/NeMo/nemo/core/classes/modelPT.py", line 464, in restore_from instance = cls._save_restore_connector.restore_from( File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 51, in restore_from return super().restore_from(*args, *kwargs) File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 1172, in restore_from checkpoint = checkpoint_io.load_checkpoint(tmp_model_weights_dir, sharded_state_dict=checkpoint) File "/opt/NeMo/nemo/utils/callbacks/dist_ckpt_io.py", line 78, in load_checkpoint return dist_checkpointing.load( File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 133, in load validate_sharding_integrity(nested_values(sharded_state_dict)) File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 425, in validate_sharding_integrity torch.distributed.all_gather_object(all_sharding, sharding) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2310, in all_gather_object all_gather(object_size_list, local_size, group=group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2724, in all_gather work = default_pg.allgather([tensor_list], [tensor]) torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1961, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3 ncclInternalError: Internal check failed. Last error: Attribute busid of node nic not found