apachemycat commented 3 months ago

06/26 11:07:50 - mmengine - INFO -

System environment: sys.platform: linux Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] CUDA available: True MUSA available: False numpy_random_seed: 1070453503 GPU 0,1: NVIDIA L40S CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.3, V12.3.103 GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 PyTorch: 2.4.0.dev20240507+cu121 PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201703
Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX512
CUDA Runtime 12.1
NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
CuDNN 8.9.2
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.19.0.dev20240507+cu121 OpenCV: 4.9.0 MMEngine: 0.10.4

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 1070453503 deterministic: False Distributed launcher: pytorch Distributed training: True GPU number: 2

I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] Starting elastic_operator with launch configs: I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] entrypoint : /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] min_nodes : 1 I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] max_nodes : 1 I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] nproc_per_node : 2 I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] run_id : none I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] rdzv_backend : static I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] rdzv_endpoint : 127.0.0.1:28346 I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] rdzv_configs : {'rank': 0, 'timeout': 900} I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] max_restarts : 0 I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] monitor_interval : 0.1 I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] log_dir : /tmp/torchelastic_yoyanqm0 I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] metrics_cfg : {} I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] I0626 11:47:13.863000 140300452140160 torch/distributed/elastic/agent/server/api.py:869] [default] starting workers for entrypoint: python3 I0626 11:47:13.863000 140300452140160 torch/distributed/elastic/agent/server/api.py:702] [default] Rendezvous'ing worker group I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] [default] Rendezvous complete for workers. Result: I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] restart_count=0 I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] master_addr=127.0.0.1 I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] master_port=28346 I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] group_rank=0 I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] group_world_size=1 I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] local_ranks=[0, 1] I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] role_ranks=[0, 1] I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] global_ranks=[0, 1] I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] role_world_sizes=[2, 2] I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] global_world_sizes=[2, 2] I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:710] [default] Starting worker group I0626 11:47:13.867000 140300452140160 torch/distributed/elastic/agent/server/local_elastic_agent.py:184] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer. I0626 11:47:13.867000 140300452140160 torch/distributed/elastic/agent/server/local_elastic_agent.py:216] Environment variable 'TORCHELASTIC_HEALTH_CHECK_PORT' not found. Do not start health check. /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead. warnings.warn( /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead. warnings.warn( /usr/local/lib/python3.10/dist-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( /usr/local/lib/python3.10/dist-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( demo-ai-xtuner-pod:210:210 [0] NCCL INFO Bootstrap : Using eth0:197.166.199.168<0> demo-ai-xtuner-pod:210:210 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation demo-ai-xtuner-pod:210:210 [0] NCCL INFO cudaDriverVersion 12040 NCCL version 2.20.5+cuda12.4 demo-ai-xtuner-pod:211:211 [1] NCCL INFO cudaDriverVersion 12040 demo-ai-xtuner-pod:211:211 [1] NCCL INFO Bootstrap : Using eth0:197.166.199.168<0> demo-ai-xtuner-pod:211:211 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation demo-ai-xtuner-pod:210:227 [0] NCCL INFO Failed to open libibverbs.so[.1] demo-ai-xtuner-pod:210:227 [0] NCCL INFO NET/Socket : Using [0]eth0:197.166.199.168<0> demo-ai-xtuner-pod:210:227 [0] NCCL INFO Using non-device net plugin version 0 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Using network Socket demo-ai-xtuner-pod:211:228 [1] NCCL INFO Failed to open libibverbs.so[.1] demo-ai-xtuner-pod:211:228 [1] NCCL INFO NET/Socket : Using [0]eth0:197.166.199.168<0> demo-ai-xtuner-pod:211:228 [1] NCCL INFO Using non-device net plugin version 0 demo-ai-xtuner-pod:211:228 [1] NCCL INFO Using network Socket demo-ai-xtuner-pod:211:228 [1] NCCL INFO comm 0x55763901fad0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1a000 commId 0x97541cd9f2e1e519 - Init START demo-ai-xtuner-pod:210:227 [0] NCCL INFO comm 0x5604fbe23ae0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 18000 commId 0x97541cd9f2e1e519 - Init START demo-ai-xtuner-pod:210:227 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff demo-ai-xtuner-pod:211:228 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff demo-ai-xtuner-pod:210:227 [0] NCCL INFO comm 0x5604fbe23ae0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 demo-ai-xtuner-pod:211:228 [1] NCCL INFO comm 0x55763901fad0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 00/04 : 0 1 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 01/04 : 0 1 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 02/04 : 0 1 demo-ai-xtuner-pod:211:228 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 03/04 : 0 1 demo-ai-xtuner-pod:211:228 [1] NCCL INFO P2P Chunksize set to 131072 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1 demo-ai-xtuner-pod:210:227 [0] NCCL INFO P2P Chunksize set to 131072 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM demo-ai-xtuner-pod:211:228 [1] NCCL INFO Connected all rings demo-ai-xtuner-pod:211:228 [1] NCCL INFO Connected all trees demo-ai-xtuner-pod:211:228 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 demo-ai-xtuner-pod:210:227 [0] NCCL INFO Connected all rings demo-ai-xtuner-pod:211:228 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer demo-ai-xtuner-pod:210:227 [0] NCCL INFO Connected all trees demo-ai-xtuner-pod:210:227 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 demo-ai-xtuner-pod:210:227 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer demo-ai-xtuner-pod:211:228 [1] NCCL INFO comm 0x55763901fad0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1a000 commId 0x97541cd9f2e1e519 - Init COMPLETE demo-ai-xtuner-pod:210:227 [0] NCCL INFO comm 0x5604fbe23ae0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 18000 commId 0x97541cd9f2e1e519 - Init COMPLETE 06/26 11:47:18 - mmengine - INFO -

work_dir = '/models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work'

06/26 11:47:18 - mmengine - DEBUG - Get class Visualizer from "visualizer" registry in "mmengine" 06/26 11:47:18 - mmengine - DEBUG - Get class TensorboardVisBackend from "vis_backend" registry in "mmengine" 06/26 11:47:18 - mmengine - DEBUG - An TensorboardVisBackend instance is built from registry, and its implementation can be found in mmengine.visualization.vis_backend 06/26 11:47:18 - mmengine - DEBUG - An Visualizer instance is built from registry, and its implementation can be found in mmengine.visualization.visualizer 06/26 11:47:18 - mmengine - DEBUG - Attribute _env_initialized is not defined in <class 'mmengine.visualization.vis_backend.TensorboardVisBackend'> or <class 'mmengine.visualization.vis_backend.TensorboardVisBackend'>._env_initialized is False,_init_envwill be called and <class 'mmengine.visualization.vis_backend.TensorboardVisBackend'>._env_initialized will be set to True 06/26 11:47:18 - mmengine - DEBUG - Get classBaseDataPreprocessorfrom "model" registry in "mmengine" 06/26 11:47:18 - mmengine - DEBUG - AnBaseDataPreprocessorinstance is built from registry, and its implementation can be found in mmengine.model.base_model.data_preprocessor quantization_config convert to <class 'transformers.utils.quantization_config.BitsAndBytesConfig'> 06/26 11:47:18 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized. low_cpu_mem_usagewas None, now set to True since model is quantized. Loading checkpoint shards: 100%|██████████| 4/4 [00:11<00:00, 2.95s/it] 06/26 11:47:30 - mmengine - DEBUG - Anfrom_pretrainedinstance is built from registry, and its implementation can be found in transformers.models.auto.auto_factory 06/26 11:47:30 - mmengine - DEBUG - AnLoraConfiginstance is built from registry, and its implementation can be found in peft.tuners.lora.config 06/26 11:47:32 - mmengine - DEBUG - AnSupervisedFinetune` instance is built from registry, and its implementation can be found in xtuner.model.sft 卡在这里，最后超时错误

[rank1]:[E626 11:57:18.466822874 ProcessGroupNCCL.cpp:572] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out. [rank1]:[E626 11:57:18.469226529 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1. demo-ai-xtuner-pod:211:231 [1] NCCL INFO [Service thread] Connection closed by localRank 1 demo-ai-xtuner-pod:211:224 [0] NCCL INFO comm 0x55763901fad0 rank 1 nranks 2 cudaDev 1 busId 1a000 - Abort COMPLETE [rank1]:[E626 11:57:18.675112321 ProcessGroupNCCL.cpp:1632] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1. [rank1]:[E626 11:57:18.675128694 ProcessGroupNCCL.cpp:586] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E626 11:57:18.675133603 ProcessGroupNCCL.cpp:592] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E626 11:57:18.675167441 ProcessGroupNCCL.cpp:1448] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc187779017 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc13a08f582 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7fc13a0964b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc13a0982bc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7fc186eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7fc18c8fdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7fc18c98ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc187779017 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc13a08f582 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7fc13a0964b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc13a0982bc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7fc186eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7fc18c8fdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7fc18c98ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1452 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc187779017 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: + 0xe549a5 (0x7fc139ce99a5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xdc253 (0x7fc186eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: + 0x94ac3 (0x7fc18c8fdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #4: clone + 0x44 (0x7fc18c98ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py FAILED

Failures:

进入到容器里，发现确实是启动了2个训练进程 root@demo-ai-xtuner-pod:/app# ps -efwww UID PID PPID C STIME TTY TIME CMD root 1 0 0 12:03 ? 00:00:00 /bin/bash /models/meta-Llama-3-8B-xtuner-trainer/train-model.sh root 10 1 86 12:03 ? 00:00:19 /usr/bin/python3 /usr/local/bin/xtuner train --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py root 143 10 27 12:03 ? 00:00:05 /usr/bin/python3 /usr/local/bin/torchrun --nnodes=1 --node_rank=0 --nproc_per_node=gpu --master_addr=127.0.0.1 --master_port=25860 /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py --launcher pytorch root 209 143 99 12:03 ? 00:00:17 /usr/bin/python3 -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py --launcher pytorch root 210 143 99 12:03 ? 00:00:17 /usr/bin/python3 -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py --launcher pytorch 工作目录中输出中只有一个进程的日志， 20240626_114717_root@demo-ai-xtuner-pod_device0_rank0.log 没有rank1.log的日志，不知道怎么设置能让rank1.log出现

apachemycat commented 3 months ago

deepseed方式的时候也是卡住。 06/27 08:41:09 - mmengine - DEBUG - An FlexibleRunner instance is built from registry, its implementation can be found inmmengine.runner._flexible_runner 06/27 08:41:09 - mmengine - INFO - xtuner_dataset_timeout = 1:00:00

[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out. demo-ai-xtuner-pod:227:287 [0] NCCL INFO [Service thread] Connection closed by localRank 1 demo-ai-xtuner-pod:228:286 [1] NCCL INFO [Service thread] Connection closed by localRank 1 demo-ai-xtuner-pod:228:280 [0] NCCL INFO comm 0x55e51d238c10 rank 1 nranks 2 cudaDev 1 busId 1a000 - Abort COMPLETE [rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0e76b81d87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f0e2c5d66e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f0e2c5d9c3d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f0e2c5da839 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7f0e762b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f0e7bd16ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7f0e7bda7bf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0e76b81d87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f0e2c5d66e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f0e2c5d9c3d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f0e2c5da839 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7f0e762b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f0e7bd16ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7f0e7bda7bf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

bjzhb666 commented 1 month ago

请问怎么解决的？谢谢

InternLM / xtuner

单机多卡训练卡住，日志也看不出问题 #792

06/26 11:07:50 - mmengine - INFO -

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 1070453503 deterministic: False Distributed launcher: pytorch Distributed training: True GPU number: 2

/usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py FAILED