PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.11k stars 2.94k forks source link

[Bug]: llama模型loss=0时出现"Tensor need be reduced must not empty [Hint: Expected x.numel() > 0, but received x.numel():0 <= 0:0.]"错误 #8299

Closed dynamicheart closed 6 months ago

dynamicheart commented 6 months ago

软件环境

- paddlepaddle-gpu: 
commit: 4ffb7da786cef844deb3cf8ad7f95d56000bd010
cuda: 12.0
cudnn: 8.9.1
- paddlenlp: 
commit: 74bb39b51bef45f32aee310efdb8994042c00bb3

重复问题

错误描述

[2024-03-05 08:06:28,678] [    INFO] - loss: 4.23760509, learning_rate: 2.999e-05, global_step: 2310, interval_runtime: 1.1534, interval_samples_per_second: 6.935981184579392, interval_steps_per_second: 0.866997648072424, epoch: 0.0229
[2024-03-05 08:06:29,834] [    INFO] - loss: 4.39690018, learning_rate: 2.999e-05, global_step: 2311, interval_runtime: 1.1555, interval_samples_per_second: 6.923501595186914, interval_steps_per_second: 0.8654376993983642, epoch: 0.0229
LAUNCH INFO 2024-03-05 08:06:34,816 Pod failed
LAUNCH ERROR 2024-03-05 08:06:34,817 Container failed !!!
Container rank 6 status failed cmd ['/usr/bin/python', '-u', 'run_pretrain.py', '--model_type', 'llama', '--model_name_or_path', 'facebook/llama-13b', '--tokenizer_name_or_path', 'facebook/llama-13b', '--input_dir', './data', '--output_dir', 'output/llama_hybrid', '--split', '949,50,1', '--max_seq_length', '2048', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--use_flash_attention', '1', '--use_fused_rope', '1', '--fuse_attention_ffn', '1', '--fuse_attention_qkv', '1', '--use_fused_rms_norm', '1', '--num_hidden_layers', '40', '--bf16', '--fp16_opt_level', 'O2', '--scale_loss', '1024', '--learning_rate', '0.00003', '--min_learning_rate', '0.000005', '--lr_scheduler_type', 'cosine', '--max_steps', '100000', '--save_steps', '100000', '--weight_decay', '0.01', '--warmup_ratio', '0.01', '--max_grad_norm', '1.0', '--logging_steps', '1', '--dataloader_num_workers', '1', '--sharding', 'stage2', '--eval_steps', '1000', '--report_to', 'visualdl', '--disable_tqdm', 'true', '--continue_training', '0', '--recompute', '0', '--do_train', '--device', 'gpu'] code 1 log output/llama_hybrid_log/workerlog.6 
env {'NV_LIBCUBLAS_VERSION': '12.0.1.189-1', 'NVIDIA_VISIBLE_DEVICES': 'all', 'COLORTERM': 'truecolor', 'NV_NVML_DEV_VERSION': '12.0.76-1', 'NV_CUDNN_PACKAGE_NAME': 'libcudnn8', 'GREP_COLOR': '1;31', 'TERM_PROGRAM_VERSION': '1.83.1', 'NV_LIBNCCL_DEV_PACKAGE': 'libnccl-dev=2.17.1-1+cuda12.0', 'NV_LIBNCCL_DEV_PACKAGE_VERSION': '2.17.1-1', 'HOSTNAME': 'szzj-isa-ai-peking-poc13.szzj.baidu.com', 'LANGUAGE': 'en_US.UTF-8', 'NVIDIA_REQUIRE_CUDA': 'cuda>=12.0 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471', 'NV_LIBCUBLAS_DEV_PACKAGE': 'libcublas-dev-12-0=12.0.1.189-1', 'NV_NVTX_VERSION': '12.0.76-1', 'NV_CUDA_CUDART_DEV_VERSION': '12.0.107-1', 'NV_LIBCUSPARSE_VERSION': '12.0.0.76-1', 'NV_LIBNPP_VERSION': '12.0.0.30-1', 'NCCL_VERSION': '2.17.1-1', 'PWD': '/host/PaddleNLP-XPU/llm/llama', 'NV_CUDNN_PACKAGE': 'libcudnn8=8.8.0.121-1+cuda12.0', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'WITH_AVX': 'ON', 'NV_NVPROF_DEV_PACKAGE': 'cuda-nvprof-12-0=12.0.90-1', 'NV_LIBNPP_PACKAGE': 'libnpp-12-0=12.0.0.30-1', 'NV_LIBNCCL_DEV_PACKAGE_NAME': 'libnccl-dev', 'GREP_OPTIONS': '--color=auto', 'VSCODE_GIT_ASKPASS_NODE': '/root/.vscode-server/bin/1.8.401.83.1.02/node', 'NV_LIBCUBLAS_DEV_VERSION': '12.0.1.189-1', 'NVIDIA_PRODUCT_NAME': 'CUDA', 'NV_LIBCUBLAS_DEV_PACKAGE_NAME': 'libcublas-dev-12-0', 'NV_CUDA_CUDART_VERSION': '12.0.107-1', 'HOME': '/root', 'LANG': 'en_US.UTF-8', 'NVIDIA_CUDA_END_OF_LIFE': '1', 'CUDA_VERSION': '12.0.0', 'NV_LIBCUBLAS_PACKAGE': 'libcublas-12-0=12.0.1.189-1', 'NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE': 'cuda-nsight-compute-12-0=12.0.0-1', 'ICODING_VERSION': '1.8.401.83.1.02', 'GIT_ASKPASS': '/root/.vscode-server/bin/1.8.401.83.1.02/extensions/git/dist/askpass.sh', 'CLICOLOR': '1', 'NV_LIBNPP_DEV_PACKAGE': 'libnpp-dev-12-0=12.0.0.30-1', 'GOROOT': '/usr/local/go', 'NV_LIBCUBLAS_PACKAGE_NAME': 'libcublas-12-0', 'NV_LIBNPP_DEV_VERSION': '12.0.0.30-1', 'VSCODE_GIT_ASKPASS_EXTRA_ARGS': '', 'WITH_GPU': 'ON', 'TERM': 'xterm-256color', 'NV_LIBCUSPARSE_DEV_VERSION': '12.0.0.76-1', 'LIBRARY_PATH': '/usr/local/cuda/lib64/stubs', 'NV_CUDNN_VERSION': '8.8.0.121', 'VSCODE_GIT_IPC_HANDLE': '/tmp/vscode-git-a504850b12.sock', 'SHLVL': '2', 'NV_CUDA_LIB_VERSION': '12.0.0-1', 'NVARCH': 'x86_64', 'CUDNN_VERSION': '8.9.1', 'NV_CUDNN_PACKAGE_DEV': 'libcudnn8-dev=8.8.0.121-1+cuda12.0', 'NV_CUDA_COMPAT_PACKAGE': 'cuda-compat-12-0', 'NV_LIBNCCL_PACKAGE': 'libnccl2=2.17.1-1+cuda12.0', 'LD_LIBRARY_PATH': '', 'NV_CUDA_NSIGHT_COMPUTE_VERSION': '12.0.0-1', 'NV_NVPROF_VERSION': '12.0.90-1', 'LC_ALL': 'en_US.UTF-8', 'VSCODE_GIT_ASKPASS_MAIN': '/root/.vscode-server/bin/1.8.401.83.1.02/extensions/git/dist/askpass-main.js', 'BROWSER': '/root/.vscode-server/bin/1.8.401.83.1.02/bin/helpers/browser.sh', 'PATH': '/root/.BCloud/bin:/root/.vscode-server/bin/1.8.401.83.1.02/bin/remote-cli:/root/.BCloud/bin:/root/.vscode-server/bin/1.8.401.83.1.02/bin:/root/.vscode-server/bin:/home/cmake-3.18.0-Linux-x86_64/bin:/usr/local/gcc-12.1/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/go/bin:/root/gopath/bin', 'NV_LIBNCCL_PACKAGE_NAME': 'libnccl2', 'NV_LIBNCCL_PACKAGE_VERSION': '2.17.1-1', 'DEBIAN_FRONTEND': 'noninteractive', 'OLDPWD': '/host/PaddleNLP-XPU', 'GOPATH': '/root/gopath', 'TERM_PROGRAM': 'vscode', 'VSCODE_IPC_HOOK_CLI': '/tmp/vscode-ipc-1f8e8da3-5315-4fd5-b7be-285e4dc98f23.sock', '_': '/usr/bin/python', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'POD_NAME': 'egfwmz', 'PADDLE_MASTER': '10.93.234.25:45151', 'PADDLE_GLOBAL_SIZE': '8', 'PADDLE_LOCAL_SIZE': '8', 'PADDLE_GLOBAL_RANK': '6', 'PADDLE_LOCAL_RANK': '6', 'PADDLE_NNODES': '1', 'PADDLE_CURRENT_ENDPOINT': '10.93.234.25:45158', 'PADDLE_TRAINER_ID': '6', 'PADDLE_TRAINERS_NUM': '8', 'PADDLE_RANK_IN_NODE': '6', 'PADDLE_TRAINER_ENDPOINTS': '10.93.234.25:45152,10.93.234.25:45153,10.93.234.25:45154,10.93.234.25:45155,10.93.234.25:45156,10.93.234.25:45157,10.93.234.25:45158,10.93.234.25:45159', 'FLAGS_selected_gpus': '6', 'PADDLE_LOG_DIR': '/host/PaddleNLP-XPU/llm/llama/output/llama_hybrid_log'}
LAUNCH INFO 2024-03-05 08:06:34,817 ------------------------- ERROR LOG DETAIL -------------------------
[32m[2024-03-05 07:21:54,674] [    INFO] - ***** Running training *****
[2024-03-05 07:21:54,674] [    INFO] -   Num examples = 806,405
[2024-03-05 07:21:54,674] [    INFO] -   Num Epochs = 1
[2024-03-05 07:21:54,674] [    INFO] -   Instantaneous batch size per device = 1
[2024-03-05 07:21:54,674] [    INFO] -   Total train batch size (w. parallel, distributed & accumulation) = 8
[2024-03-05 07:21:54,674] [    INFO] -   Gradient Accumulation steps = 1
[2024-03-05 07:21:54,674] [    INFO] -   Total optimization steps = 100,000
[2024-03-05 07:21:54,674] [    INFO] -   Total num train samples = 800,000
[2024-03-05 07:21:54,676] [    INFO] -   Number of trainable parameters = 13,015,864,320 (per device)
I0305 07:21:56.126010 76258 custom_operator.cc:1296] register pir custom op :fused_rms_norm
I0305 07:21:56.126060 76258 custom_operator.cc:1296] register pir custom op :fused_rms_norm_grad
I0305 07:21:56.126178 76258 custom_operator.cc:1296] register pir custom op :fused_ln
I0305 07:21:56.126186 76258 custom_operator.cc:1296] register pir custom op :fused_ln_grad
Traceback (most recent call last):
  File "/host/PaddleNLP-XPU/llm/llama/run_pretrain.py", line 567, in <module>
    main()
  File "/host/PaddleNLP-XPU/llm/llama/run_pretrain.py", line 548, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/host/PaddleNLP-XPU/paddlenlp/trainer/trainer.py", line 890, in train
    dp_master_grad = (
  File "/host/PaddleNLP-XPU/paddlenlp/trainer/trainer.py", line 1900, in training_step
  File "/host/PaddleNLP-XPU/paddlenlp/trainer/trainer.py", line 1853, in compute_loss
    labels = (inputs.pop("start_positions"), inputs.pop("end_positions"))
  File "/usr/local/lib/python3.10/dist-packages/paddle/nn/layer/layers.py", line 1429, in __call__
    return self.forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py", line 190, in forward
    fw = self._layer(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle/nn/layer/layers.py", line 1429, in __call__
    return self.forward(*inputs, **kwargs)
  File "/host/PaddleNLP-XPU/paddlenlp/transformers/llama/modeling.py", line 1611, in forward
    loss = self.criterion(logits, labels)
  File "/usr/local/lib/python3.10/dist-packages/paddle/nn/layer/layers.py", line 1429, in __call__
    return self.forward(*inputs, **kwargs)
  File "/host/PaddleNLP-XPU/paddlenlp/transformers/llama/modeling.py", line 1427, in forward
    loss = paddle.mean(masked_lm_loss)
  File "/usr/local/lib/python3.10/dist-packages/paddle/tensor/stat.py", line 90, in mean
    return _C_ops.mean(x, axis, keepdim)
ValueError: (InvalidArgument) Tensor need be reduced must not empty.
  [Hint: Expected x.numel() > 0, but received x.numel():0 <= 0:0.] (at ../paddle/phi/kernels/funcs/reduce_function.h:1052)

LAUNCH INFO 2024-03-05 08:06:40,653 Exit code -15

稳定复现步骤 & 代码

错误来自于这两行

由于masked_lm_loss.numel() == 0,对其进行paddle.mean操作会报如上错误,loss为0的原因应该是softmax操作产生了onehot tensor, 只有target label对应位置的值为1,其它位置为0。

import numpy as np

def stable_softmax(x):
    z = x - np.max(x, axis=-1, keepdims=True)
    print("z", z)
    numerator = np.exp(z)
    print("numerator", numerator)
    denominator = np.sum(numerator, axis=-1, keepdims=True)
    print("denominator", denominator)
    softmax = numerator / denominator
    print("softmax", softmax)
    return softmax

x = [-2710.10620117, -2914.37866211, -5045.04443359, -4361.91601562, -459.57000732, 8843.65820312, -1871.62756348, 5447.12451172, -10947.22949219]
stable_softmax(x)

# z [-11553.76440429 -11758.03686523 -13888.70263671 -13205.57421874 -9303.22821044 0  -10715.2857666 -3396.5336914  -19790.88769531]
# numerator [0. 0. 0. 0. 0. 1. 0. 0. 0.]
# denominator [1.]
# softmax [0. 0. 0. 0. 0. 1. 0. 0. 0.]
# array([0., 0., 0., 0., 0., 1., 0., 0., 0.])

当exp的指数较小时(小于-1000),结果会等于0

参考资料:

w5688414 commented 6 months ago

感谢您的反馈,我查了一下是这个pr引入的问题:

https://github.com/PaddlePaddle/PaddleNLP/commit/93e78c2f3fd3d33054da49c78551a148810caaf3#diff-99e104eff4c095428aa1cd5d186107ae22737297e8ec3b5c12cd138e69a79cb5

看看下面的实现能否解决您的问题:

masked_lm_loss = masked_lm_loss[masked_lm_labels != self.ignore_index]
dynamicheart commented 6 months ago

@w5688414 好的,看上去这样,如果数据集处理得没问题,应该能保证masked_lm_loss不为空tensor。我后续试一下,但这个不是稳定复现的。

cqulilujia commented 6 months ago

在使用pipeparallel=2、shardingstage1配置跑llama模型pretrain时,又踩到了这个坑,定位到是现在的loss函数返回了loss=float(0),导致触发了paddle/distributed/fleet/meta_parallel/pipeline_parallel.py中的assert,log如下:

在使用 #8459 中的修复方法之后,绕过了pp中的类型检查,但是程序会在step=81这步卡住,不能再正常向下运行。推测是否是新建tensor导致梯度断掉,而导致pp配置下的某些通讯逻辑不能正常执行

[32m[2024-05-15 16:26:28,733] [ INFO] - loss: 7.44834805, learning_rate: 2.4e-06, global_step: 79, current_memory_allocated: 42.891517996788025, current_memory_reserved: 0.0, max_memory_allocated: 82.25603437423706, max_memory_reserved: 0.0, interval_runtime: 29.755, interval_samples_per_second: 4.3018, interval_tokens_per_second_per_device: 2202.5182, interval_steps_per_second: 0.0336, progress_or_epoch: 0.0008 [2024-05-15 16:26:58,668] [ INFO] - loss: 7.31905365, learning_rate: 2.43e-06, global_step: 80, current_memory_allocated: 42.891517996788025, current_memory_reserved: 0.0, max_memory_allocated: 82.25603437423706, max_memory_reserved: 0.0, interval_runtime: 29.935, interval_samples_per_second: 4.2759, interval_tokens_per_second_per_device: 2189.279, interval_steps_per_second: 0.0334, progress_or_epoch: 0.0008 LAUNCH INFO 2024-05-15 16:27:24,714 Pod failed LAUNCH ERROR 2024-05-15 16:27:24,715 Container failed !!! Container rank 4 status failed cmd ['/root/miniconda3/envs/paddle/bin/python', '-u', 'run_pretrain.py', '--model_name_or_path', 'meta-llama/Llama-2-13b', '--tokenizer_name_or_path', 'meta-llama/Llama-2-13b', '--input_dir', './data', '--output_dir', 'output/llama2-13b-4k/20240515154555', '--split', '949,50,1', '--max_seq_length', '4096', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--use_flash_attention', '1', '--use_fused_rope', '1', '--fuse_attention_ffn', '1', '--fuse_attention_qkv', '1', '--use_fused_rms_norm', '1', '--num_hidden_layers', '40', '--bf16', '--fp16_opt_level', 'O2', '--scale_loss', '1024', '--learning_rate', '0.00003', '--min_learning_rate', '0.000005', '--lr_scheduler_type', 'cosine', '--max_steps', '100000', '--save_steps', '100000', '--weight_decay', '0.01', '--warmup_ratio', '0.01', '--max_grad_norm', '1.0', '--logging_steps', '1', '--sequence_parallel', '0', '--dataloader_num_workers', '4', '--pipeline_parallel_degree', '2', '--pipeline_parallel_config', 'disable_partial_send_recv', '--tensor_parallel_degree', '1', '--tensor_parallel_config', 'enable_mp_async_allreduce,enable_mp_skip_c_identity', '--gradient_accumulation_steps', '32', '--sharding', 'stage1', '--eval_steps', '1000', '--report_to', 'visualdl', '--disable_tqdm', 'true', '--continue_training', '0', '--recompute', '0', '--do_train', '--seed', '1026', '--device', 'xpu'] code 1 log output/llama2-13b-4k/20240515154555_log/workerlog.4 env {'PYTHONPATH': '../../:', 'LSCOLORS': 'Gxfxcxdxbxegedabagacad', 'LESS': '-R', 'CONDA_EXE': '/root/miniconda3/bin/conda', '_CE_M': '', 'XPU_CDNN_CLUSTER_PARALLEL_STREAM_NUMBER': '2', 'HOSTNAME': 'localhost.localdomain', 'PWD': '/workspace/PaddleNLP/llm/llama', 'LOGNAME': 'root', 'CONDA_PREFIX': '/root/miniconda3/envs/paddle', 'XPU_PADDLE_L3_SIZE1': '1024', 'XPU_PADDLE_L3_SIZE0': '1024', 'XBLAS_FC_HBM_VERSION': '40', 'FLAGS_use_stride_kernel': '0', 'HOME': '/root', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:*.xspf=00;36:', 'CONDA_PROMPT_MODIFIER': '(paddle) ', 'TERM': 'xterm', 'XPU_CDNN_CLUSTER_PARALLEL': '1', 'ZSH': '/root/.oh-my-zsh', '_CE_CONDA': '', 'XPUAPI_DEFAULT_SIZE0': '1502653248', 'XPUAPI_DEFAULT_SIZE1': '380265324', 'CONDA_SHLVL': '2', 'SHLVL': '2', 'PAGER': 'less', 'CUDA_DEVICE_MAX_CONNECTIONS': '8', 'CONDA_PYTHON_EXE': '/root/miniconda3/bin/python', 'LD_LIBRARY_PATH': '/workspace/so-bkcl/:/workspace/so-runtime/:/workspace/so-fast_paddle/:', 'CONDA_DEFAULT_ENV': 'paddle', 'XPU_FORCE_USERMODE_LAUNCH': '1', 'PATH': '/root/miniconda3/envs/paddle/bin:/root/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'CONDA_PREFIX1': '/root/miniconda3', 'OLDPWD': '/workspace/PaddleNLP', '': '/root/miniconda3/envs/paddle/bin/python', 'LC_CTYPE': 'C.UTF-8', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'POD_NAME': 'cztpec', 'PADDLE_MASTER': '127.0.0.1:36569', 'PADDLE_GLOBAL_SIZE': '8', 'PADDLE_LOCAL_SIZE': '8', 'PADDLE_GLOBAL_RANK': '4', 'PADDLE_LOCAL_RANK': '4', 'PADDLE_NNODES': '1', 'PADDLE_CURRENT_ENDPOINT': '127.0.0.1:36574', 'PADDLE_TRAINER_ID': '4', 'PADDLE_TRAINERS_NUM': '8', 'PADDLE_RANK_IN_NODE': '4', 'PADDLE_TRAINER_ENDPOINTS': '127.0.0.1:36570,127.0.0.1:36571,127.0.0.1:36572,127.0.0.1:36573,127.0.0.1:36574,127.0.0.1:36575,127.0.0.1:36576,127.0.0.1:36577', 'FLAGS_selected_xpus': '4', 'PADDLE_LOG_DIR': '/workspace/PaddleNLP/llm/llama/output/llama2-13b-4k/20240515154555_log'} LAUNCH INFO 2024-05-15 16:27:24,715 ------------------------- ERROR LOG DETAIL ------------------------- dygraph_optimizer/dygraph_sharding_optimizer.py:101: UserWarning: nccl reduce_avg requires paddle compiled with cuda and nccl>=2.10.0, please check compilation setups. warnings.warn( [2024-05-15 15:46:56,542] [ WARNING] hybrid_parallel_optimizer.py:292 - While using ClipGradByGlobalNorm in TensorParallel, PipelineParallel or Sharding, the grad clip of original optimizer will be changed. [2024-05-15 15:46:56,542] [ INFO] - [timelog] checkpoint loading time: 0.00s (2024-05-15 15:46:56)  [2024-05-15 15:46:56,543] [ INFO] - Running training  [2024-05-15 15:46:56,543] [ INFO] - Num examples = 12,816,085 [2024-05-15 15:46:56,543] [ INFO] - Num Epochs = 1 [2024-05-15 15:46:56,543] [ INFO] - Instantaneous batch size per device = 1 [2024-05-15 15:46:56,543] [ INFO] - Total train batch size (w. parallel, distributed & accumulation) = 128 [2024-05-15 15:46:56,543] [ INFO] - Gradient Accumulation steps = 32 [2024-05-15 15:46:56,543] [ INFO] - Total optimization steps = 100,000 [2024-05-15 15:46:56,543] [ INFO] - Total num train samples = 12,800,000 [2024-05-15 15:46:56,545] [ DEBUG] - Number of trainable parameters = 6,507,934,720 (per device) [2024-05-15 15:46:56,563] [ DEBUG] - Number of trainable parameters = 13,015,863,296 (all devices, roughly) /root/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/amp/auto_cast.py:502: UserWarning: XPUPlace only support float16 amp. warnings.warn('XPUPlace only support float16 amp.') Traceback (most recent call last): File "/workspace/PaddleNLP/llm/llama/run_pretrain.py", line 630, in main() File "/workspace/PaddleNLP/llm/llama/run_pretrain.py", line 608, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 770, in train return self._inner_training_loop( File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 964, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 2044, in training_step return self.training_pipeline_step(model, inputs) File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 2113, in training_pipeline_step loss = model.forward_backward_pipeline(inputs, self.scaler if self.do_grad_scaling else None) File "/root/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 536, in forward_backward_pipeline output_tensor = self._forward_step(input_tensor, micro_dataset) File "/root/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 789, in _forward_step assert isinstance( AssertionError: Currently, loss_fn should obtain Paddle.Tensor dtype LAUNCH INFO 2024-05-15 16:27:29,316 Exit code -15