KeyError: 'schema'，slave节点无法启动，貌似数据加载有问题？

同一个网段的不同机器。

这是主节点的启动信息：

root@dsw-541920-5bb8f5cffd-85szk:/mnt/workspace/modelscope# GLOO_SOCKET_IFNAME=eth1 NCCL_SOCKET_IFNAME=eth1 NPROC_PER_NODE=1 NNODES=3 PORT=10010 ADDR=172.16.0.27 NODE_RANK=0 xtuner train /mnt/nas_data/code/llama2_7b_chat_qlora_custom_sft_e1_copy.py --deepspeed deepspeed_zero2 --launcher pytorch
[2024-07-03 17:22:39,579] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
2024-07-03 17:22:40.861759: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-03 17:22:40.899515: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-03 17:22:41.557749: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2024-07-03 17:22:59,626] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
2024-07-03 17:23:00.869250: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-03 17:23:00.906921: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-03 17:23:01.555169: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:46: UserWarning: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
/usr/local/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
[2024-07-03 17:23:02,971] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-03 17:23:02,971] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
07/03 17:23:03 - mmengine - INFO - 
------------------------------------------------------------
System environment:
    sys.platform: linux
    Python: 3.10.14 (main, May 29 2024, 23:47:02) [GCC 11.4.0]
    CUDA available: True
    MUSA available: False
    numpy_random_seed: 2132492593
    GPU 0: NVIDIA A10
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 12.1, V12.1.105
    GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
    PyTorch: 2.3.0+cu121
    PyTorch compiling details: PyTorch built with:
  - GCC 11.4
  - C++ Version: 201703
  - Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.0, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

    TorchVision: 0.18.0+cu121
    OpenCV: 4.9.0
    MMEngine: 0.10.4

Runtime environment:
    launcher: pytorch
    randomness: {'seed': None, 'deterministic': False}
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: None
    deterministic: False
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 3
------------------------------------------------------------

07/03 17:23:03 - mmengine - INFO - Config:
SYSTEM = ''
accumulative_counts = 16
batch_size = 1
betas = (
    0.9,
    0.999,
)
custom_hooks = [
    dict(
        tokenizer=dict(
            padding_side='right',
            pretrained_model_name_or_path=
            '/mnt/nas_data/model/LLM-Research/Meta-Llama-3-8B-Instruct',
            trust_remote_code=True,
            type='transformers.AutoTokenizer.from_pretrained'),
        type='xtuner.engine.hooks.DatasetInfoHook'),
    dict(
        evaluation_inputs=[
            '您认为纳粹式优生学的信仰和同性恋患有精神疾病的信仰有问题吗？',
            'would you consider belief in nazi style eugenics and the belief that gays are mentally ill problematic? ',
        ],
        every_n_iters=500,
        prompt_template='xtuner.utils.PROMPT_TEMPLATE.llama3_chat',
        system='',
        tokenizer=dict(
            padding_side='right',
            pretrained_model_name_or_path=
            '/mnt/nas_data/model/LLM-Research/Meta-Llama-3-8B-Instruct',
            trust_remote_code=True,
            type='transformers.AutoTokenizer.from_pretrained'),
        type='xtuner.engine.hooks.EvaluateChatHook'),
]
data_files = [
    '/mnt/nas_data/dataset/BBQ.json',
    '/mnt/nas_data/dataset/CrowS_Paris.json',
    '/mnt/nas_data/dataset/WinoBias.json',
    '/mnt/nas_data/dataset/RedditBias.json',
    '/mnt/nas_data/dataset/HolisticBias.json',
]
dataloader_num_workers = 0
default_hooks = dict(
    checkpoint=dict(
        by_epoch=False,
        interval=500,
        max_keep_ckpts=2,
        type='mmengine.hooks.CheckpointHook'),
    logger=dict(
        interval=10,
        log_metric_by_epoch=False,
        type='mmengine.hooks.LoggerHook'),
    param_scheduler=dict(type='mmengine.hooks.ParamSchedulerHook'),
    sampler_seed=dict(type='mmengine.hooks.DistSamplerSeedHook'),
    timer=dict(type='mmengine.hooks.IterTimerHook'))
env_cfg = dict(
    cudnn_benchmark=False,
    dist_cfg=dict(backend='nccl'),
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
evaluation_freq = 500
evaluation_inputs = [
    '您认为有问题吗？',
    'would you consider that gays are mentally ill problematic? ',
]
launcher = 'pytorch'
load_from = None
log_level = 'INFO'
log_processor = dict(by_epoch=False)
lr = 2e-05
max_epochs = 1
max_length = 4096
max_norm = 1
model = dict(
    llm=dict(
        pretrained_model_name_or_path=
        '/mnt/nas_data/model/LLM-Research/Meta-Llama-3-8B-Instruct',
        quantization_config=dict(
            bnb_4bit_compute_dtype='torch.float16',
            bnb_4bit_quant_type='nf4',
            bnb_4bit_use_double_quant=True,
            llm_int8_has_fp16_weight=False,
            llm_int8_threshold=6.0,
            load_in_4bit=True,
            load_in_8bit=False,
            type='transformers.BitsAndBytesConfig'),
        torch_dtype='torch.float16',
        trust_remote_code=True,
        type='transformers.AutoModelForCausalLM.from_pretrained'),
    lora=dict(
        bias='none',
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        task_type='CAUSAL_LM',
        type='peft.LoraConfig'),
    type='xtuner.model.SupervisedFinetune',
    use_varlen_attn=False)
optim_type = 'torch.optim.AdamW'
optim_wrapper = dict(
    optimizer=dict(
        betas=(
            0.9,
            0.999,
        ),
        lr=2e-05,
        type='torch.optim.AdamW',
        weight_decay=0),
    type='DeepSpeedOptimWrapper')
pack_to_max_length = True
param_scheduler = [
    dict(
        begin=0,
        by_epoch=True,
        convert_to_iter_based=True,
        end=0.03,
        start_factor=1e-05,
        type='mmengine.optim.LinearLR'),
    dict(
        begin=0.03,
        by_epoch=True,
        convert_to_iter_based=True,
        end=1,
        eta_min=0.0,
        type='mmengine.optim.CosineAnnealingLR'),
]
pretrained_model_name_or_path = '/mnt/nas_data/model/LLM-Research/Meta-Llama-3-8B-Instruct'
prompt_template = 'xtuner.utils.PROMPT_TEMPLATE.llama3_chat'
randomness = dict(deterministic=False, seed=None)
resume = False
runner_type = 'FlexibleRunner'
save_steps = 500
save_total_limit = 2
strategy = dict(
    config=dict(
        bf16=dict(enabled=True),
        fp16=dict(enabled=False, initial_scale_power=16),
        gradient_accumulation_steps='auto',
        gradient_clipping='auto',
        train_micro_batch_size_per_gpu='auto',
        zero_allow_untested_optimizer=True,
        zero_force_ds_cpu_optimizer=False,
        zero_optimization=dict(overlap_comm=True, stage=2)),
    exclude_frozen_parameters=True,
    gradient_accumulation_steps=16,
    gradient_clipping=1,
    sequence_parallel_size=1,
    train_micro_batch_size_per_gpu=1,
    type='xtuner.engine.DeepSpeedStrategy')
tokenizer = dict(
    padding_side='right',
    pretrained_model_name_or_path=
    '/mnt/nas_data/model/LLM-Research/Meta-Llama-3-8B-Instruct',
    trust_remote_code=True,
    type='transformers.AutoTokenizer.from_pretrained')
train_cfg = dict(max_epochs=1, type='xtuner.engine.runner.TrainLoop')
train_dataloader = dict(
    batch_size=1,
    collate_fn=dict(
        type='xtuner.dataset.collate_fns.default_collate_fn',
        use_varlen_attn=False),
    dataset=dict(
        dataset=dict(
            data_files=[
                '/mnt/nas_data/dataset/BBQ.json',
                '/mnt/nas_data/dataset/CrowS_Paris.json',
                '/mnt/nas_data/dataset/WinoBias.json',
                '/mnt/nas_data/dataset/RedditBias.json',
                '/mnt/nas_data/dataset/HolisticBias.json',
            ],
            path='json',
            type='datasets.load_dataset'),
        dataset_map_fn='xtuner.dataset.map_fns.openai_map_fn',
        max_length=4096,
        pack_to_max_length=True,
        remove_unused_columns=True,
        shuffle_before_pack=True,
        template_map_fn=dict(
            template='xtuner.utils.PROMPT_TEMPLATE.llama3_chat',
            type='xtuner.dataset.map_fns.template_map_fn_factory'),
        tokenizer=dict(
            padding_side='right',
            pretrained_model_name_or_path=
            '/mnt/nas_data/model/LLM-Research/Meta-Llama-3-8B-Instruct',
            trust_remote_code=True,
            type='transformers.AutoTokenizer.from_pretrained'),
        type='xtuner.dataset.process_hf_dataset',
        use_varlen_attn=False),
    num_workers=0,
    sampler=dict(shuffle=True, type='mmengine.dataset.DefaultSampler'))
train_dataset = dict(
    dataset=dict(
        data_files=[
            '/mnt/nas_data/dataset/BBQ.json',
            '/mnt/nas_data/dataset/CrowS_Paris.json',
            '/mnt/nas_data/dataset/WinoBias.json',
            '/mnt/nas_data/dataset/RedditBias.json',
            '/mnt/nas_data/dataset/HolisticBias.json',
        ],
        path='json',
        type='datasets.load_dataset'),
    dataset_map_fn='xtuner.dataset.map_fns.openai_map_fn',
    max_length=4096,
    pack_to_max_length=True,
    remove_unused_columns=True,
    shuffle_before_pack=True,
    template_map_fn=dict(
        template='xtuner.utils.PROMPT_TEMPLATE.llama3_chat',
        type='xtuner.dataset.map_fns.template_map_fn_factory'),
    tokenizer=dict(
        padding_side='right',
        pretrained_model_name_or_path=
        '/mnt/nas_data/model/LLM-Research/Meta-Llama-3-8B-Instruct',
        trust_remote_code=True,
        type='transformers.AutoTokenizer.from_pretrained'),
    type='xtuner.dataset.process_hf_dataset',
    use_varlen_attn=False)
use_varlen_attn = False
visualizer = None
warmup_ratio = 0.03
weight_decay = 0
work_dir = './work_dirs/llama2_7b_chat_qlora_custom_sft_e1_copy'

07/03 17:23:03 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
07/03 17:23:04 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) RuntimeInfoHook                    
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
before_train:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(NORMAL      ) DatasetInfoHook                    
(LOW         ) EvaluateChatHook                   
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(NORMAL      ) DistSamplerSeedHook                
 -------------------- 
before_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
(LOW         ) ParamSchedulerHook                 
(LOW         ) EvaluateChatHook                   
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) IterTimerHook                      
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_val:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) DatasetInfoHook                    
 -------------------- 
before_val_epoch:
(NORMAL      ) IterTimerHook                      
 -------------------- 
before_val_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_val_iter:
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_val_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
after_val:
(VERY_HIGH   ) RuntimeInfoHook                    
(LOW         ) EvaluateChatHook                   
 -------------------- 
after_train:
(VERY_HIGH   ) RuntimeInfoHook                    
(LOW         ) EvaluateChatHook                   
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_test:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) DatasetInfoHook                    
 -------------------- 
before_test_epoch:
(NORMAL      ) IterTimerHook                      
 -------------------- 
before_test_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_test_iter:
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_test_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_test:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
after_run:
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
07/03 17:23:04 - mmengine - INFO - xtuner_dataset_timeout = 1:00:00
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Flattening the indices (num_proc=32): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 98085/98085 [00:01<00:00, 73192.27 examples/s]
Map (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 98085/98085 [00:03<00:00, 27944.42 examples/s]
Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3398/3398 [00:02<00:00, 1490.52 examples/s]
07/03 17:24:34 - mmengine - WARNING - Dataset Dataset has no metainfo. ``dataset_meta`` in visualizer will be None.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.22it/s]
07/03 17:24:38 - mmengine - INFO - Dispatch LlamaFlashAttention2 forward. Due to the implementation of the PyTorch version of flash attention, even when the `output_attentions` flag is set to True, it is not possible to return the `attn_weights`.
[rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
socketProgressOpt: Call to recv from 172.16.0.29<58637> failed : Broken pipe
Exception raised from checkForNCCLErrorsInternal at /torch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1723 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x94 (0x7f355c4a71b4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x13c9c62 (0x7f355da16c62 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x2a3 (0x7f355d9e4443 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0xa3 (0x7f355d9e46c3 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::watchdogHandler() + 0x168 (0x7f355d9ec668 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x128 (0x7f355d9ed768 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f35c8eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f35e0030ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f35e00c1a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 0] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
socketProgressOpt: Call to recv from 172.16.0.29<58637> failed : Broken pipe
Exception raised from checkForNCCLErrorsInternal at /torch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1723 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x94 (0x7f355c4a71b4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x13c9c62 (0x7f355da16c62 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x2a3 (0x7f355d9e4443 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0xa3 (0x7f355d9e46c3 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::watchdogHandler() + 0x168 (0x7f355d9ec668 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x128 (0x7f355d9ed768 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f35c8eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f35e0030ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f35e00c1a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /torch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x94 (0x7f355c4a71b4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x13c9c62 (0x7f355da16c62 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xfc0437 (0x7f355d60d437 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdc253 (0x7f35c8eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x94ac3 (0x7f35e0030ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: clone + 0x44 (0x7f35e00c1a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

E0703 17:24:57.293000 139730214137344 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 26374) of binary: /usr/local/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/usr/local/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-03_17:24:57
  host      : dsw-541920-5bb8f5cffd-85szk
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 26374)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 26374
============================================================

子节点失败：

(xtuner_env) root@dsw-545694-76485c9f8c-f8pln:/mnt/workspace# GLOO_SOCKET_IFNAME=eth1 NCCL_SOCKET_IFNAME=eth1 NPROC_PER_NODE=1 NNODES=3 PORT=10010 ADDR=172.16.0.27 NODE_RANK=1 xtuner train /mnt/nas_data/code/llama2_7b_chat_qlora_custom_sft_e1_copy.py --deepspeed deepspeed_zero2 --launcher pytorch
[2024-07-03 17:22:45,227] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-03 17:22:59,773] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:46: UserWarning: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
[2024-07-03 17:23:01,576] [INFO] [comm.py:637:init_distributed] cdb=None
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
state: {'blocks': [[MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[128000,128006,9125,128007,271,...,5201,369,682,7931,13],[128009,128000,128006,9125,128007,...,271,791,50320,1732,128009],...,[39026,2646,7077,323,568,...,50183,323,54168,1288,539],[128000,128006,9125,128007,271,...,271,3957,1070,904,15837]]]
labels: [[[-100,-100,-100,-100,-100,...,5201,369,682,7931,13],[128009,-100,-100,-100,-100,...,-100,791,50320,1732,128009],...,[-100,-100,-100,-100,-100,...,50183,323,54168,1288,539],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[3118,304,279,11914,25,...,271,15346,1778,50183,323],[54168,5101,304,701,11503,...,11,1778,50183,323,54168],...,[128000,128006,9125,128007,271,...,889,10187,54168,922,31139],[1274,13,128009,128006,882,...,128007,271,3957,1070,904]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,11,1778,50183,323,54168],...,[-100,-100,-100,-100,-100,...,889,10187,54168,922,31139],[1274,13,128009,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[15837,3118,304,279,11914,...,323,65302,2212,389,872],[4641,382,14924,25,10699,...,527,264,11190,18328,13],...,[37993,1461,11,323,279,...,40,12491,220,1490,4771],[6418,6800,7044,3238,128009,...,21277,15837,13,128009,128006]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],...,[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,21277,15837,13,128009,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[882,128007,271,10445,374,...,420,11914,30,128009,128006],[78191,128007,271,2028,11914,...,128000,128006,9125,128007,271],...,[882,128007,271,15346,1778,...,311,5766,48761,4221,30],[128009,128006,78191,128007,271,...,89971,315,12491,13,3709]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,2028,11914,...,-100,-100,-100,-100,-100],...,[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[2053,527,3238,30,128009,...,904,15837,3118,304,279],[11914,25,364,1820,13168,...,10765,323,2686,15837,304],...,[7931,13,128009,128000,128006,...,323,5201,369,682,7931],[13,128009,128000,128006,9125,...,323,5766,1778,50183,311]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],...,[7931,13,128009,-100,-100,...,323,5201,369,682,7931],[13,128009,-100,-100,-100,...,323,5766,1778,50183,311]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[12192,51841,323,5201,369,...,128009,128006,78191,128007,271],[2028,955,315,15837,374,...,128009,128006,882,128007,271],...,[36033,30,128009,128006,78191,...,128000,128006,9125,128007,271],[2675,527,264,11190,18328,...,264,11190,323,51841,18328]]]
labels: [[[12192,51841,323,5201,369,...,-100,-100,-100,-100,-100],[2028,955,315,15837,374,...,128009,-100,-100,-100,-100],...,[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[13,128009,128006,882,128007,...,5766,48761,4221,30,128009],[128006,78191,128007,271,53216,...,10765,323,2686,15837,304],...,[5766,1778,50183,311,12192,...,128006,78191,128007,271,791],[4113,11914,50829,10026,15837,...,128007,271,15546,2643,387]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,53216,...,-100,-100,-100,-100,-100],...,[5766,1778,50183,311,12192,...,-100,-100,-100,-100,791],[4113,11914,50829,10026,15837,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[810,4461,311,2019,420,...,364,339,1900,539,837],[814,1101,617,3776,1274,...,904,15837,3118,304,279],...,[13,128009,128006,882,128007,...,128009,128006,882,128007,271],[15346,1778,50183,323,54168,...,882,128007,271,5618,10765]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],...,[-100,-100,-100,-100,-100,...,128009,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[279,955,315,15837,304,...,264,4948,4221,323,433],[596,1120,1633,2653,311,...,128006,882,128007,271,5618],...,[1606,433,39990,51841,11,...,271,15346,1778,50183,323],[54168,5101,304,701,11503,...,36033,1606,433,95593,28856]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],...,[1606,433,39990,51841,11,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,36033,1606,433,95593,28856]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[54168,323,649,17210,311,...,128007,271,15546,2643,387],[810,4461,311,2019,420,...,2688,3403,2038,128009,128000],...,[311,2019,420,11914,30,...,50183,323,54168,5101,304],[701,11503,30,128009,128006,...,83652,4078,25,13265,271]]]
labels: [[[54168,323,649,17210,311,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,2688,3403,2038,128009,-100],...,[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[69656,323,6941,527,65816,...,433,95593,28856,54168,323],[649,17210,311,21934,323,...,11503,30,128009,128006,78191],...,[2686,15837,304,23719,13,...,10765,323,2686,15837,304],[23719,13,128009,128006,882,...,128009,128006,882,128007,271]]]
labels: [[[-100,-100,-100,-100,-100,...,433,95593,28856,54168,323],[649,17210,311,21934,323,...,-100,-100,-100,-100,-100],...,[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,128009,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[10445,374,420,955,315,...,2643,387,810,4461,311],[387,1071,555,4423,889,...,271,2028,955,315,15837],...,[1778,50183,311,12192,51841,...,83652,4078,25,3833,32718],[271,644,1912,15419,11,...,4832,719,1202,8659,539]]]
labels: [[[-100,-100,-100,-100,-100,...,2643,387,810,4461,311],[387,1071,555,4423,889,...,-100,2028,955,315,15837],...,[1778,50183,311,12192,51841,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[14289,71090,128009,128006,78191,...,11503,30,128009,128006,78191],[128007,271,2822,11,1778,...,128006,78191,128007,271,2822],...,[78191,128007,271,2822,11,...,7564,11,3053,956,4320],[11,578,10989,7564,128009,...,882,128007,271,3957,1070]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,2822,11,1778,...,-100,-100,-100,-100,2822],...,[-100,-100,-100,2822,11,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[904,15837,3118,304,279,...,7555,3238,128009,128006,78191],[128007,271,791,11914,5727,...,2675,527,264,11190,18328],...,[13265,271,791,7926,62475,...,12192,51841,323,5201,369],[682,7931,13,128009,128006,...,750,5380,3883,25,13926]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,791,11914,5727,...,-100,-100,-100,-100,-100],...,[-100,-100,-100,-100,-100,...,12192,51841,323,5201,369],[682,7931,13,128009,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[11,17314,47737,11,35336,...,279,2768,11914,25,364],[2170,264,15196,2748,11,...,11,1778,50183,323,54168],...,[128009,128006,78191,128007,271,...,12192,51841,323,5201,369],[682,7931,13,128009,128006,...,25,578,13219,11,578]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,11,1778,50183,323,54168],...,[-100,-100,-100,-100,-100,...,12192,51841,323,5201,369],[682,7931,13,128009,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[1732,449,264,10723,17563,...,271,2675,527,264,11190],[323,51841,18328,13,128009,...,264,11190,18328,13,4718],...,[95593,28856,54168,323,649,...,128009,128000,128006,9125,128007],[271,2675,527,264,11190,...,128009,128006,882,128007,271]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],...,[95593,28856,54168,323,649,...,128009,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,128009,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[10445,374,433,3062,311,...,128006,882,128007,271,5618],[10765,279,955,315,15837,...,15837,304,279,2768,11914],...,[271,2822,11,1778,50183,...,311,10765,323,2686,15837],[304,23719,13,128009,128006,...,54168,922,8071,6855,1274]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],...,[-100,2822,11,1778,50183,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,54168,922,8071,6855,1274]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[13,128009,128006,882,128007,...,459,81391,26923,374,2653],[369,856,13219,1606,5127,...,13,128009,128006,882,128007],...,[11,1778,50183,323,54168,...,3339,2771,1274,1274,520],[279,4717,1051,6220,11,...,264,1099,2234,88,45323]]]
labels: [[[13,128009,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],...,[11,1778,50183,323,54168,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[3238,128009,128006,78191,128007,...,15837,304,23719,13,128009],[128006,882,128007,271,5618,...,11,279,538,18543,11],...,[128000,128006,9125,128007,271,...,1778,50183,323,54168,1288],[539,5101,304,856,11503,...,364,40,3021,1694,259]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],...,[-100,-100,-100,-100,-100,...,1778,50183,323,54168,1288],[539,5101,304,856,11503,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[1036,258,3238,128009,128006,...,9125,128007,271,2675,527],[264,11190,18328,13,4718,...,32305,13,1102,374,3062],...,[128000,128006,9125,128007,271,...,13,4718,3465,374,311],[10765,323,2686,15837,304,...,882,128007,271,5618,10765]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,32305,13,1102,374,3062],...,[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[279,955,315,15837,304,...,264,11190,18328,13,4718],[3465,374,311,10765,323,...,13,128009,128006,882,128007],...,[128000,128006,9125,128007,271,...,3053,956,4320,11,41184],[28982,9700,128009,128006,78191,...,369,682,7931,13,128009]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],...,[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,369,682,7931,13,128009]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[128000,128006,9125,128007,271,...,271,2028,955,315,15837],[374,36033,1606,433,95593,...,31033,128009,128006,78191,128007],...,[433,39990,51841,11,45409,...,11,578,28501,893,128009],[128006,78191,128007,271,791,...,810,4461,311,2019,420]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,2028,955,315,15837],[374,36033,1606,433,95593,...,-100,-100,-100,-100,-100],...,[433,39990,51841,11,45409,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,791,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[11914,30,128009,128006,78191,...,10765,279,955,315,15837],[304,279,2768,11914,25,...,323,2686,15837,304,23719],...,[420,11914,30,128009,128006,...,311,15641,323,5766,1778],[50183,311,12192,51841,323,...,128007,271,2675,527,264]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],...,[-100,-100,-100,-100,-100,...,311,15641,323,5766,1778],[50183,311,12192,51841,323,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[11190,18328,13,4718,3465,...,374,36033,1606,433,95593],[28856,54168,323,649,17210,...,30,128009,128006,78191,128007],...,[13,4718,3465,374,311,...,2019,420,11914,30,128009],[128006,78191,128007,271,2028,...,311,2019,420,11914,30]]]
labels: [[[-100,-100,-100,-100,-100,...,374,36033,1606,433,95593],[28856,54168,323,649,17210,...,-100,-100,-100,-100,-100],...,[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,2028,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[128009,128006,78191,128007,271,...,11914,30,128009,128006,78191],[128007,271,2028,11914,2643,...,955,315,15837,304,279],...,[128000,128006,9125,128007,271,...,2254,4029,4219,13,18056],[279,28378,574,2216,11190,...,40,1781,65242,21566,300]]]
labels: [[[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,2028,11914,2643,...,-100,-100,-100,-100,-100],...,[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[527,279,12047,3238,128009,...,810,4461,311,387,1071],[555,4423,889,10187,54168,...,2675,527,264,11190,323],...,[11326,5380,3883,25,3053,...,3674,32305,13,1102,374],[128000,128006,9125,128007,271,...,78191,128007,271,2028,955]]]
labels: [[[-100,-100,-100,-100,-100,...,810,4461,311,387,1071],[555,4423,889,10187,54168,...,-100,-100,-100,-100,-100],...,[-100,-100,-100,-100,-100,...,3674,32305,13,1102,374],[-100,-100,-100,-100,-100,...,-100,-100,-100,2028,955]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[315,15837,374,36033,1606,...,9647,922,430,3238,128009],[128006,78191,128007,271,791,...,21934,323,3674,32305,13],...,[30,128009,128006,78191,128007,...,11914,30,128009,128006,78191],[128000,128006,9125,128007,271,...,527,3361,71090,128009,128006]]]
labels: [[[315,15837,374,36033,1606,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,791,...,21934,323,3674,32305,13],...,[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[78191,128007,271,791,11914,...,311,387,1071,555,4423],[889,10187,54168,922,682,...,2442,3212,11,578,14992],...,[128006,9125,128007,271,2675,...,12192,51841,323,5201,369],[128000,128006,9125,128007,271,...,420,11914,30,128009,128006]]]
labels: [[[-100,-100,-100,791,11914,...,311,387,1071,555,4423],[889,10187,54168,922,682,...,-100,-100,-100,-100,-100],...,[-100,-100,-100,-100,-100,...,12192,51841,323,5201,369],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[78191,128007,271,2028,11914,...,4718,3465,374,311,10765],[323,2686,15837,304,23719,...,1598,1274,13,128009,128006],...,[128009,128006,78191,128007,271,...,128006,9125,128007,271,2675],[128000,128006,9125,128007,271,...,578,8485,5743,128009,128006]]]
labels: [[[-100,-100,-100,2028,11914,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,1598,1274,13,128009,-100],...,[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[78191,128007,271,791,25020,...,8997,32080,3238,128009,128006],[78191,128007,271,791,11914,...,271,791,11914,5727,13042],...,[311,21934,323,3674,32305,...,922,7556,449,1023,1274],[11,719,279,5333,449,...,682,7931,13,128009,128000]]]
labels: [[[-100,-100,-100,791,25020,...,-100,-100,-100,-100,-100],[-100,-100,-100,791,11914,...,-100,791,11914,5727,13042],...,[311,21934,323,3674,32305,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,682,7931,13,128009,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[128000,128006,9125,128007,271,...,15837,13,128009,128006,882],[128007,271,15546,2643,387,...,2686,15837,304,23719,13],...,[3674,32305,13,1102,374,...,3041,499,19737,389,279],[1989,719,279,3828,374,...,8774,53302,323,3339,709]]]
labels: [[[-100,-100,-100,-100,-100,...,15837,13,128009,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],...,[3674,32305,13,1102,374,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]], [MemoryMappedTable
input_ids: list<item: int32>
  child 0, item: int32
labels: list<item: int64>
  child 0, item: int64
length: int64
----
input_ids: [[[8545,36569,382,14924,25,...,128009,128006,882,128007,271],[128000,128006,9125,128007,271,...,48761,4221,30,128009,128006],...,[21277,15837,13,128009,128006,...,128007,271,15546,2643,387],[810,4461,311,2019,420,...,315,15837,304,279,2768]]]
labels: [[[-100,-100,-100,-100,-100,...,128009,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100],...,[21277,15837,13,128009,-100,...,-100,-100,-100,-100,-100],[-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100]]]
length: [[4096,4096,4096,4096,4096,...,4096,4096,4096,4096,4096]]]]}
[rank1]: Traceback (most recent call last):
[rank1]:   File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/xtuner/tools/train.py", line 360, in <module>
[rank1]:     main()
[rank1]:   File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/xtuner/tools/train.py", line 356, in main
[rank1]:     runner.train()
[rank1]:   File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
[rank1]:     self._train_loop = self.build_train_loop(
[rank1]:   File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
[rank1]:     loop = LOOPS.build(
[rank1]:   File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank1]:     return self.build_func(cfg, *args, **kwargs, registry=self)
[rank1]:   File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank1]:     obj = obj_cls(**args)  # type: ignore
[rank1]:   File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__
[rank1]:     dataloader = runner.build_dataloader(
[rank1]:   File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
[rank1]:     dataset = DATASETS.build(dataset_cfg)
[rank1]:   File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank1]:     return self.build_func(cfg, *args, **kwargs, registry=self)
[rank1]:   File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank1]:     obj = obj_cls(**args)  # type: ignore
[rank1]:   File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
[rank1]:     dist.broadcast_object_list(objects, src=0)
[rank1]:   File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2674, in broadcast_object_list
[rank1]:     object_list[i] = _tensor_to_object(obj_view, obj_size, group)
[rank1]:   File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2362, in _tensor_to_object
[rank1]:     return _unpickler(io.BytesIO(buf)).load()
**[rank1]:   File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/datasets/table.py", line 1319, in __setstate__
[rank1]:     schema = state["schema"]
[rank1]: KeyError: 'schema'**
E0703 17:24:37.265000 139958018063872 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 8731) of binary: /mnt/workspace/modelscope/xtuner_env/bin/python
Traceback (most recent call last):
  File "/mnt/workspace/modelscope/xtuner_env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/mnt/workspace/modelscope/xtuner_env/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-03_17:24:37
  host      : dsw-545694-76485c9f8c-f8pln
  rank      : 1 (local_rank: 0)
  exitcode  : 1 (pid: 8731)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

InternLM / xtuner

KeyError: 'schema'，slave节点无法启动，貌似数据加载有问题？ #806

同一个网段的不同机器。