InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.81k stars 302 forks source link

单机多卡训练卡住,日志也看不出问题 #792

Open apachemycat opened 3 months ago

apachemycat commented 3 months ago

06/26 11:07:50 - mmengine - INFO -

System environment: sys.platform: linux Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] CUDA available: True MUSA available: False numpy_random_seed: 1070453503 GPU 0,1: NVIDIA L40S CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.3, V12.3.103 GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 PyTorch: 2.4.0.dev20240507+cu121 PyTorch compiling details: PyTorch built with:

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 1070453503 deterministic: False Distributed launcher: pytorch Distributed training: True GPU number: 2

I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] Starting elastic_operator with launch configs: I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] entrypoint : /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] min_nodes : 1 I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] max_nodes : 1 I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] nproc_per_node : 2 I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] run_id : none I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] rdzv_backend : static I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] rdzv_endpoint : 127.0.0.1:28346 I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] rdzv_configs : {'rank': 0, 'timeout': 900} I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] max_restarts : 0 I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] monitor_interval : 0.1 I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] log_dir : /tmp/torchelastic_yoyanqm0 I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] metrics_cfg : {} I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] I0626 11:47:13.863000 140300452140160 torch/distributed/elastic/agent/server/api.py:869] [default] starting workers for entrypoint: python3 I0626 11:47:13.863000 140300452140160 torch/distributed/elastic/agent/server/api.py:702] [default] Rendezvous'ing worker group I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] [default] Rendezvous complete for workers. Result: I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] restart_count=0 I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] master_addr=127.0.0.1 I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] master_port=28346 I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] group_rank=0 I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] group_world_size=1 I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] local_ranks=[0, 1] I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] role_ranks=[0, 1] I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] global_ranks=[0, 1] I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] role_world_sizes=[2, 2] I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] global_world_sizes=[2, 2] I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:710] [default] Starting worker group I0626 11:47:13.867000 140300452140160 torch/distributed/elastic/agent/server/local_elastic_agent.py:184] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer. I0626 11:47:13.867000 140300452140160 torch/distributed/elastic/agent/server/local_elastic_agent.py:216] Environment variable 'TORCHELASTIC_HEALTH_CHECK_PORT' not found. Do not start health check. /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead. warnings.warn( /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead. warnings.warn( /usr/local/lib/python3.10/dist-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( /usr/local/lib/python3.10/dist-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( demo-ai-xtuner-pod:210:210 [0] NCCL INFO Bootstrap : Using eth0:197.166.199.168<0> demo-ai-xtuner-pod:210:210 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation demo-ai-xtuner-pod:210:210 [0] NCCL INFO cudaDriverVersion 12040 NCCL version 2.20.5+cuda12.4 demo-ai-xtuner-pod:211:211 [1] NCCL INFO cudaDriverVersion 12040 demo-ai-xtuner-pod:211:211 [1] NCCL INFO Bootstrap : Using eth0:197.166.199.168<0> demo-ai-xtuner-pod:211:211 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation demo-ai-xtuner-pod:210:227 [0] NCCL INFO Failed to open libibverbs.so[.1] demo-ai-xtuner-pod:210:227 [0] NCCL INFO NET/Socket : Using [0]eth0:197.166.199.168<0> demo-ai-xtuner-pod:210:227 [0] NCCL INFO Using non-device net plugin version 0 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Using network Socket demo-ai-xtuner-pod:211:228 [1] NCCL INFO Failed to open libibverbs.so[.1] demo-ai-xtuner-pod:211:228 [1] NCCL INFO NET/Socket : Using [0]eth0:197.166.199.168<0> demo-ai-xtuner-pod:211:228 [1] NCCL INFO Using non-device net plugin version 0 demo-ai-xtuner-pod:211:228 [1] NCCL INFO Using network Socket demo-ai-xtuner-pod:211:228 [1] NCCL INFO comm 0x55763901fad0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1a000 commId 0x97541cd9f2e1e519 - Init START demo-ai-xtuner-pod:210:227 [0] NCCL INFO comm 0x5604fbe23ae0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 18000 commId 0x97541cd9f2e1e519 - Init START demo-ai-xtuner-pod:210:227 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff demo-ai-xtuner-pod:211:228 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff demo-ai-xtuner-pod:210:227 [0] NCCL INFO comm 0x5604fbe23ae0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 demo-ai-xtuner-pod:211:228 [1] NCCL INFO comm 0x55763901fad0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 00/04 : 0 1 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 01/04 : 0 1 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 02/04 : 0 1 demo-ai-xtuner-pod:211:228 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 03/04 : 0 1 demo-ai-xtuner-pod:211:228 [1] NCCL INFO P2P Chunksize set to 131072 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1 demo-ai-xtuner-pod:210:227 [0] NCCL INFO P2P Chunksize set to 131072 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM demo-ai-xtuner-pod:211:228 [1] NCCL INFO Connected all rings demo-ai-xtuner-pod:211:228 [1] NCCL INFO Connected all trees demo-ai-xtuner-pod:211:228 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Connected all rings demo-ai-xtuner-pod:211:228 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer demo-ai-xtuner-pod:210:227 [0] NCCL INFO Connected all trees demo-ai-xtuner-pod:210:227 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 demo-ai-xtuner-pod:210:227 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer demo-ai-xtuner-pod:211:228 [1] NCCL INFO comm 0x55763901fad0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1a000 commId 0x97541cd9f2e1e519 - Init COMPLETE demo-ai-xtuner-pod:210:227 [0] NCCL INFO comm 0x5604fbe23ae0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 18000 commId 0x97541cd9f2e1e519 - Init COMPLETE 06/26 11:47:18 - mmengine - INFO -

work_dir = '/models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work'

06/26 11:47:18 - mmengine - DEBUG - Get class Visualizer from "visualizer" registry in "mmengine" 06/26 11:47:18 - mmengine - DEBUG - Get class TensorboardVisBackend from "vis_backend" registry in "mmengine" 06/26 11:47:18 - mmengine - DEBUG - An TensorboardVisBackend instance is built from registry, and its implementation can be found in mmengine.visualization.vis_backend 06/26 11:47:18 - mmengine - DEBUG - An Visualizer instance is built from registry, and its implementation can be found in mmengine.visualization.visualizer 06/26 11:47:18 - mmengine - DEBUG - Attribute _env_initialized is not defined in <class 'mmengine.visualization.vis_backend.TensorboardVisBackend'> or <class 'mmengine.visualization.vis_backend.TensorboardVisBackend'>._env_initialized is False,_init_envwill be called and <class 'mmengine.visualization.vis_backend.TensorboardVisBackend'>._env_initialized will be set to True 06/26 11:47:18 - mmengine - DEBUG - Get classBaseDataPreprocessorfrom "model" registry in "mmengine" 06/26 11:47:18 - mmengine - DEBUG - AnBaseDataPreprocessorinstance is built from registry, and its implementation can be found in mmengine.model.base_model.data_preprocessor quantization_config convert to <class 'transformers.utils.quantization_config.BitsAndBytesConfig'> 06/26 11:47:18 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized. low_cpu_mem_usagewas None, now set to True since model is quantized. Loading checkpoint shards: 100%|██████████| 4/4 [00:11<00:00, 2.95s/it] 06/26 11:47:30 - mmengine - DEBUG - Anfrom_pretrainedinstance is built from registry, and its implementation can be found in transformers.models.auto.auto_factory 06/26 11:47:30 - mmengine - DEBUG - AnLoraConfiginstance is built from registry, and its implementation can be found in peft.tuners.lora.config 06/26 11:47:32 - mmengine - DEBUG - AnSupervisedFinetune` instance is built from registry, and its implementation can be found in xtuner.model.sft 卡在这里, 最后超时错误

[rank1]:[E626 11:57:18.466822874 ProcessGroupNCCL.cpp:572] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out. [rank1]:[E626 11:57:18.469226529 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1. demo-ai-xtuner-pod:211:231 [1] NCCL INFO [Service thread] Connection closed by localRank 1 demo-ai-xtuner-pod:211:224 [0] NCCL INFO comm 0x55763901fad0 rank 1 nranks 2 cudaDev 1 busId 1a000 - Abort COMPLETE [rank1]:[E626 11:57:18.675112321 ProcessGroupNCCL.cpp:1632] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1. [rank1]:[E626 11:57:18.675128694 ProcessGroupNCCL.cpp:586] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E626 11:57:18.675133603 ProcessGroupNCCL.cpp:592] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E626 11:57:18.675167441 ProcessGroupNCCL.cpp:1448] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc187779017 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc13a08f582 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7fc13a0964b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc13a0982bc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7fc186eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7fc18c8fdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7fc18c98ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc187779017 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc13a08f582 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7fc13a0964b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc13a0982bc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7fc186eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7fc18c8fdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7fc18c98ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1452 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc187779017 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: + 0xe549a5 (0x7fc139ce99a5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xdc253 (0x7fc186eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: + 0x94ac3 (0x7fc18c8fdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #4: clone + 0x44 (0x7fc18c98ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W0626 11:57:21.928000 140300452140160 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 210 closing signal SIGTERM E0626 11:57:22.092000 140300452140160 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 1 (pid: 211) of binary: /usr/bin/python3 I0626 11:57:22.096000 140300452140160 torch/distributed/elastic/multiprocessing/errors/init.py:360] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 1) Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py FAILED

Failures:

进入到容器里,发现确实是启动了2个训练进程 root@demo-ai-xtuner-pod:/app# ps -efwww UID PID PPID C STIME TTY TIME CMD root 1 0 0 12:03 ? 00:00:00 /bin/bash /models/meta-Llama-3-8B-xtuner-trainer/train-model.sh root 10 1 86 12:03 ? 00:00:19 /usr/bin/python3 /usr/local/bin/xtuner train --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py root 143 10 27 12:03 ? 00:00:05 /usr/bin/python3 /usr/local/bin/torchrun --nnodes=1 --node_rank=0 --nproc_per_node=gpu --master_addr=127.0.0.1 --master_port=25860 /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py --launcher pytorch root 209 143 99 12:03 ? 00:00:17 /usr/bin/python3 -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py --launcher pytorch root 210 143 99 12:03 ? 00:00:17 /usr/bin/python3 -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py --launcher pytorch 工作目录中输出中只有一个进程的日志, 20240626_114717_root@demo-ai-xtuner-pod_device0_rank0.log 没有rank1.log的日志, 不知道怎么设置能让rank1.log出现
apachemycat commented 3 months ago

deepseed方式的时候也是卡住。 06/27 08:41:09 - mmengine - DEBUG - An FlexibleRunner instance is built from registry, its implementation can be found inmmengine.runner._flexible_runner 06/27 08:41:09 - mmengine - INFO - xtuner_dataset_timeout = 1:00:00

[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out. demo-ai-xtuner-pod:227:287 [0] NCCL INFO [Service thread] Connection closed by localRank 1 demo-ai-xtuner-pod:228:286 [1] NCCL INFO [Service thread] Connection closed by localRank 1 demo-ai-xtuner-pod:228:280 [0] NCCL INFO comm 0x55e51d238c10 rank 1 nranks 2 cudaDev 1 busId 1a000 - Abort COMPLETE [rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0e76b81d87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f0e2c5d66e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f0e2c5d9c3d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f0e2c5da839 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7f0e762b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f0e7bd16ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7f0e7bda7bf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0e76b81d87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f0e2c5d66e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f0e2c5d9c3d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f0e2c5da839 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7f0e762b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f0e7bd16ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7f0e7bda7bf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

bjzhb666 commented 1 month ago

请问怎么解决的?谢谢