Not work even use the official docker when multiple GPU training LLM

hellangleZ commented 1 month ago

even I use 24.05.01， still stuck in this status

use another sft script which refer by official document is also not work by this error

https://docs.nvidia.com/nemo-framework/user-guide/latest/modelalignment/sft.html#step-2-sft-training

Traceback (most recent call last): File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 243, in main() File "/opt/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper _run_hydra( File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra _run_app( File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app run_and_report( File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report raise ex File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report return func() File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in lambda: hydra.run( File "/usr/local/lib/python3.10/dist-packages/hydra/internal/hydra.py", line 132, in run = ret.return_value File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 129, in main ptl_model, updated_cfg = load_from_nemo( File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 98, in load_from_nemo model = cls.restore_from( File "/opt/NeMo/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from return super().restore_from( File "/opt/NeMo/nemo/core/classes/modelPT.py", line 464, in restore_from instance = cls._save_restore_connector.restore_from( File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 51, in restore_from return super().restore_from(*args, *kwargs) File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 1172, in restore_from checkpoint = checkpoint_io.load_checkpoint(tmp_model_weights_dir, sharded_state_dict=checkpoint) File "/opt/NeMo/nemo/utils/callbacks/dist_ckpt_io.py", line 78, in load_checkpoint return dist_checkpointing.load( File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 133, in load validate_sharding_integrity(nested_values(sharded_state_dict)) File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 425, in validate_sharding_integrity torch.distributed.all_gather_object(all_sharding, sharding) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2310, in all_gather_object all_gather(object_size_list, local_size, group=group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2724, in all_gather work = default_pg.allgather([tensor_list], [tensor]) torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1961, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3 ncclInternalError: Internal check failed. Last error: Attribute busid of node nic not found

Mercury7353 commented 1 month ago

Same Error here

github-actions[bot] commented 1 week ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 3 days ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

NVIDIA / NeMo

Not work even use the official docker when multiple GPU training LLM #9709