Closed dynamicheart closed 6 months ago
感谢您的反馈,我查了一下是这个pr引入的问题:
看看下面的实现能否解决您的问题:
masked_lm_loss = masked_lm_loss[masked_lm_labels != self.ignore_index]
@w5688414 好的,看上去这样,如果数据集处理得没问题,应该能保证masked_lm_loss
不为空tensor。我后续试一下,但这个不是稳定复现的。
在使用pipeparallel=2、shardingstage1配置跑llama模型pretrain时,又踩到了这个坑,定位到是现在的loss函数返回了loss=float(0),导致触发了paddle/distributed/fleet/meta_parallel/pipeline_parallel.py中的assert,log如下:
在使用 #8459 中的修复方法之后,绕过了pp中的类型检查,但是程序会在step=81这步卡住,不能再正常向下运行。推测是否是新建tensor导致梯度断掉,而导致pp配置下的某些通讯逻辑不能正常执行
[32m[2024-05-15 16:26:28,733] [ INFO][0m - loss: 7.44834805, learning_rate: 2.4e-06, global_step: 79, current_memory_allocated: 42.891517996788025, current_memory_reserved: 0.0, max_memory_allocated: 82.25603437423706, max_memory_reserved: 0.0, interval_runtime: 29.755, interval_samples_per_second: 4.3018, interval_tokens_per_second_per_device: 2202.5182, interval_steps_per_second: 0.0336, progress_or_epoch: 0.0008[0m
[32m[2024-05-15 16:26:58,668] [ INFO][0m - loss: 7.31905365, learning_rate: 2.43e-06, global_step: 80, current_memory_allocated: 42.891517996788025, current_memory_reserved: 0.0, max_memory_allocated: 82.25603437423706, max_memory_reserved: 0.0, interval_runtime: 29.935, interval_samples_per_second: 4.2759, interval_tokens_per_second_per_device: 2189.279, interval_steps_per_second: 0.0334, progress_or_epoch: 0.0008[0m
LAUNCH INFO 2024-05-15 16:27:24,714 Pod failed
LAUNCH ERROR 2024-05-15 16:27:24,715 Container failed !!!
Container rank 4 status failed cmd ['/root/miniconda3/envs/paddle/bin/python', '-u', 'run_pretrain.py', '--model_name_or_path', 'meta-llama/Llama-2-13b', '--tokenizer_name_or_path', 'meta-llama/Llama-2-13b', '--input_dir', './data', '--output_dir', 'output/llama2-13b-4k/20240515154555', '--split', '949,50,1', '--max_seq_length', '4096', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--use_flash_attention', '1', '--use_fused_rope', '1', '--fuse_attention_ffn', '1', '--fuse_attention_qkv', '1', '--use_fused_rms_norm', '1', '--num_hidden_layers', '40', '--bf16', '--fp16_opt_level', 'O2', '--scale_loss', '1024', '--learning_rate', '0.00003', '--min_learning_rate', '0.000005', '--lr_scheduler_type', 'cosine', '--max_steps', '100000', '--save_steps', '100000', '--weight_decay', '0.01', '--warmup_ratio', '0.01', '--max_grad_norm', '1.0', '--logging_steps', '1', '--sequence_parallel', '0', '--dataloader_num_workers', '4', '--pipeline_parallel_degree', '2', '--pipeline_parallel_config', 'disable_partial_send_recv', '--tensor_parallel_degree', '1', '--tensor_parallel_config', 'enable_mp_async_allreduce,enable_mp_skip_c_identity', '--gradient_accumulation_steps', '32', '--sharding', 'stage1', '--eval_steps', '1000', '--report_to', 'visualdl', '--disable_tqdm', 'true', '--continue_training', '0', '--recompute', '0', '--do_train', '--seed', '1026', '--device', 'xpu'] code 1 log output/llama2-13b-4k/20240515154555_log/workerlog.4
env {'PYTHONPATH': '../../:', 'LSCOLORS': 'Gxfxcxdxbxegedabagacad', 'LESS': '-R', 'CONDA_EXE': '/root/miniconda3/bin/conda', '_CE_M': '', 'XPU_CDNN_CLUSTER_PARALLEL_STREAM_NUMBER': '2', 'HOSTNAME': 'localhost.localdomain', 'PWD': '/workspace/PaddleNLP/llm/llama', 'LOGNAME': 'root', 'CONDA_PREFIX': '/root/miniconda3/envs/paddle', 'XPU_PADDLE_L3_SIZE1': '1024', 'XPU_PADDLE_L3_SIZE0': '1024', 'XBLAS_FC_HBM_VERSION': '40', 'FLAGS_use_stride_kernel': '0', 'HOME': '/root', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:*.xspf=00;36:', 'CONDA_PROMPT_MODIFIER': '(paddle) ', 'TERM': 'xterm', 'XPU_CDNN_CLUSTER_PARALLEL': '1', 'ZSH': '/root/.oh-my-zsh', '_CE_CONDA': '', 'XPUAPI_DEFAULT_SIZE0': '1502653248', 'XPUAPI_DEFAULT_SIZE1': '380265324', 'CONDA_SHLVL': '2', 'SHLVL': '2', 'PAGER': 'less', 'CUDA_DEVICE_MAX_CONNECTIONS': '8', 'CONDA_PYTHON_EXE': '/root/miniconda3/bin/python', 'LD_LIBRARY_PATH': '/workspace/so-bkcl/:/workspace/so-runtime/:/workspace/so-fast_paddle/:', 'CONDA_DEFAULT_ENV': 'paddle', 'XPU_FORCE_USERMODE_LAUNCH': '1', 'PATH': '/root/miniconda3/envs/paddle/bin:/root/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'CONDA_PREFIX1': '/root/miniconda3', 'OLDPWD': '/workspace/PaddleNLP', '': '/root/miniconda3/envs/paddle/bin/python', 'LC_CTYPE': 'C.UTF-8', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'POD_NAME': 'cztpec', 'PADDLE_MASTER': '127.0.0.1:36569', 'PADDLE_GLOBAL_SIZE': '8', 'PADDLE_LOCAL_SIZE': '8', 'PADDLE_GLOBAL_RANK': '4', 'PADDLE_LOCAL_RANK': '4', 'PADDLE_NNODES': '1', 'PADDLE_CURRENT_ENDPOINT': '127.0.0.1:36574', 'PADDLE_TRAINER_ID': '4', 'PADDLE_TRAINERS_NUM': '8', 'PADDLE_RANK_IN_NODE': '4', 'PADDLE_TRAINER_ENDPOINTS': '127.0.0.1:36570,127.0.0.1:36571,127.0.0.1:36572,127.0.0.1:36573,127.0.0.1:36574,127.0.0.1:36575,127.0.0.1:36576,127.0.0.1:36577', 'FLAGS_selected_xpus': '4', 'PADDLE_LOG_DIR': '/workspace/PaddleNLP/llm/llama/output/llama2-13b-4k/20240515154555_log'}
LAUNCH INFO 2024-05-15 16:27:24,715 ------------------------- ERROR LOG DETAIL -------------------------
dygraph_optimizer/dygraph_sharding_optimizer.py:101: UserWarning: nccl reduce_avg requires paddle compiled with cuda and nccl>=2.10.0, please check compilation setups.
warnings.warn(
[2024-05-15 15:46:56,542] [ WARNING] hybrid_parallel_optimizer.py:292 - While using ClipGradByGlobalNorm in TensorParallel, PipelineParallel or Sharding, the grad clip of original optimizer will be changed.
[32m[2024-05-15 15:46:56,542] [ INFO][0m - [timelog] checkpoint loading time: 0.00s (2024-05-15 15:46:56) [0m
[32m[2024-05-15 15:46:56,543] [ INFO][0m - Running training [0m
[32m[2024-05-15 15:46:56,543] [ INFO][0m - Num examples = 12,816,085[0m
[32m[2024-05-15 15:46:56,543] [ INFO][0m - Num Epochs = 1[0m
[32m[2024-05-15 15:46:56,543] [ INFO][0m - Instantaneous batch size per device = 1[0m
[32m[2024-05-15 15:46:56,543] [ INFO][0m - Total train batch size (w. parallel, distributed & accumulation) = 128[0m
[32m[2024-05-15 15:46:56,543] [ INFO][0m - Gradient Accumulation steps = 32[0m
[32m[2024-05-15 15:46:56,543] [ INFO][0m - Total optimization steps = 100,000[0m
[32m[2024-05-15 15:46:56,543] [ INFO][0m - Total num train samples = 12,800,000[0m
[35m[2024-05-15 15:46:56,545] [ DEBUG][0m - Number of trainable parameters = 6,507,934,720 (per device)[0m
[35m[2024-05-15 15:46:56,563] [ DEBUG][0m - Number of trainable parameters = 13,015,863,296 (all devices, roughly)[0m
/root/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/amp/auto_cast.py:502: UserWarning: XPUPlace only support float16 amp.
warnings.warn('XPUPlace only support float16 amp.')
Traceback (most recent call last):
File "/workspace/PaddleNLP/llm/llama/run_pretrain.py", line 630, in
软件环境
重复问题
错误描述
稳定复现步骤 & 代码
错误来自于这两行
由于
masked_lm_loss.numel() == 0
,对其进行paddle.mean
操作会报如上错误,loss为0的原因应该是softmax操作产生了onehot tensor, 只有target label对应位置的值为1,其它位置为0。当exp的指数较小时(小于-1000),结果会等于0
参考资料: