PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.07k stars 5.54k forks source link

单机多卡训练报错 [operator < coalesce_tensor > error] #66282

Open xieshuaix opened 1 month ago

xieshuaix commented 1 month ago

背景: 环境: 机器: 8 P40物理机 docker: paddlecloud镜像: iregistry.baidu-int.com/paddlecloud/base-images:paddlecloud-ubuntu20.04-gcc8.2-cuda11.8-cudnn8.9-openmpi4.1.5-codelab1.6.1.5-bccl2.15.5.4-hadoop2.2.4.2-afsshell1.9.3.4095 paddlepaddle-gpu: 2.6.1

paddle.utils.run_check() 单卡和多卡正常 训练方式: fleet + 静态图 单卡训练无问题

问题: 双卡及以上训练报错 报错日志: env {'SHELL': '/bin/bash', 'NV_LIBCUBLAS_VERSION': '11.11.3.6-1', 'NVIDIA_VISIBLE_DEVICES': 'all', 'NV_NVML_DEV_VERSION': '11.8.86-1', 'NV_CUDNN_PACKAGE_NAME': 'libcudnn8', 'NV_LIBNCCL_DEV_PACKAGE': 'libnccl-dev=2.16.2-1+cuda11.8', 'NV_LIBNCCL_DEV_PACKAGE_VERSION': '2.16.2-1', 'HOSTNAME': 'XXXXXX', 'LANGUAGE': 'en_US.UTF-8', 'NVIDIA_REQUIRE_CUDA': 'cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=510,driver<511 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=tesla,driver>=515,driver<516 brand=unknown,driver>=515,driver<516 brand=nvidia,driver>=515,driver<516 brand=nvidiartx,driver>=515,driver<516 brand=geforce,driver>=515,driver<516 brand=geforcertx,driver>=515,driver<516 brand=quadro,driver>=515,driver<516 brand=quadrortx,driver>=515,driver<516 brand=titan,driver>=515,driver<516 brand=titanrtx,driver>=515,driver<516', 'NV_LIBCUBLAS_DEV_PACKAGE': 'libcublas-dev-11-8=11.11.3.6-1', 'NV_NVTX_VERSION': '11.8.86-1', 'NV_CUDA_CUDART_DEV_VERSION': '11.8.89-1', 'NV_LIBCUSPARSE_VERSION': '11.7.5.86-1', 'NV_LIBNPP_VERSION': '11.8.0.86-1', 'NCCL_VERSION': '2.16.2-1', 'PWD': '/root/work/baidu/map-navi-rec/travel-recommend', 'NV_CUDNN_PACKAGE': 'libcudnn8=8.9.0.131-1+cuda11.8', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'WITH_AVX': 'ON', 'NV_NVPROF_DEV_PACKAGE': 'cuda-nvprof-11-8=11.8.87-1', 'NV_LIBNPP_PACKAGE': 'libnpp-11-8=11.8.0.86-1', 'NV_LIBNCCL_DEV_PACKAGE_NAME': 'libnccl-dev', 'TZ': 'Asia/Shanghai', 'NV_LIBCUBLAS_DEV_VERSION': '11.11.3.6-1', 'BASH': '/bin/sh', 'NVIDIA_PRODUCT_NAME': 'CUDA', 'NV_LIBCUBLAS_DEV_PACKAGE_NAME': 'libcublas-dev-11-8', 'NV_CUDA_CUDART_VERSION': '11.8.89-1', 'HOME': '/root', 'LANG': 'en_US.UTF-8', 'CUDA_VERSION': '11.8.0', 'NV_LIBCUBLAS_PACKAGE': 'libcublas-11-8=11.11.3.6-1', 'NVIDIA_TOOLS': '/home/opt/cuda_tools', 'NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE': 'cuda-nsight-compute-11-8=11.8.0-1', 'NV_LIBNPP_DEV_PACKAGE': 'libnpp-dev-11-8=11.8.0.86-1', 'GOROOT': '/usr/local/go', 'NV_LIBCUBLAS_PACKAGE_NAME': 'libcublas-11-8', 'NV_LIBNPP_DEV_VERSION': '11.8.0.86-1', 'OPENMPI_HOME': '/usr/local/openmpi-4.1.5', 'WITH_GPU': 'ON', 'TERM': 'xterm', 'NV_LIBCUSPARSE_DEV_VERSION': '11.7.5.86-1', 'HADOOP_HOME': '/root/paddlejob/hadoop-client/hadoop', 'LIBRARY_PATH': '/usr/local/cuda/lib64/stubs', 'NV_CUDNN_VERSION': '8.9.0.131', 'SHLVL': '2', 'HOME_WORK_DIR': '/root/paddlejob', 'NV_CUDA_LIB_VERSION': '11.8.0-1', 'NVARCH': 'x86_64', 'CUDNN_VERSION': '8.6.0', 'NV_CUDNN_PACKAGE_DEV': 'libcudnn8-dev=8.9.0.131-1+cuda11.8', 'NV_CUDA_COMPAT_PACKAGE': 'cuda-compat-11-8', 'NV_LIBNCCL_PACKAGE': 'libnccl2=2.16.2-1+cuda11.8', 'LD_LIBRARY_PATH': '/usr/local/lib:/usr/local/openmpi-4.1.5/lib:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/:/usr/lib/x86_64-linux-gnu/:/usr/lib/:/usr/lib64:/usr/local/cuda-11.8/targets/x86_64-linux/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64', 'NV_CUDA_NSIGHT_COMPUTE_VERSION': '11.8.0-1', 'NV_NVPROF_VERSION': '11.8.87-1', 'LC_ALL': 'en_US.UTF-8', 'PATH': '/root/work/tools/ripgrep-13.0.0-x86_64-unknown-linux-musl/:/root/work/tools/ripgrep-13.0.0-x86_64-unknown-linux-musl/:/usr/local/bin:/usr/local/openmpi-4.1.5/bin:/home/cmake-3.16.0-Linux-x86_64/bin:/home/opt/cuda_tools:/bin:/bin:/root/paddlejob/hadoop-client/hadoop/bin:/usr/local/gcc-8.2/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/sbin:/usr/bin:/sbin:/bin', 'NV_LIBNCCL_PACKAGE_NAME': 'libnccl2', 'NV_LIBNCCL_PACKAGE_VERSION': '2.16.2-1', 'DEBIANFRONTEND': 'noninteractive', 'OLDPWD': '/root/work', 'GOPATH': '/root/gopath', '': '/usr/local/bin/python', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'POD_NAME': 'bapotg', 'PADDLE_MASTER': '10.255.75.25:42010', 'PADDLE_GLOBAL_SIZE': '2', 'PADDLE_LOCAL_SIZE': '2', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1', 'PADDLE_CURRENT_ENDPOINT': '10.255.75.25:42011', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '2', 'PADDLE_RANK_IN_NODE': '0', 'PADDLE_TRAINER_ENDPOINTS': '10.255.75.25:42011,10.255.75.25:42012', 'FLAGS_selected_gpus': '0', 'PADDLE_LOG_DIR': '/root/work/baidu/map-navi-rec/travel-recommend/log'} LAUNCH INFO 2024-07-19 17:18:34,227 ------------------------- ERROR LOG DETAIL ------------------------- exe.run( ValueError: In user code:

File "/root/work/baidu/map-navi-rec/travel-recommend/paddle_infp/train_v3.py", line 360, in <module>
  main(args)
File "/root/work/baidu/map-navi-rec/travel-recommend/paddle_infp/train_v3.py", line 349, in main
  train(conf, args.dataset_dir, args.dataset_type, args.out_dir, args.log_dir, resume=args.resume)
File "/root/work/baidu/map-navi-rec/travel-recommend/paddle_infp/train_v3.py", line 122, in train
  optimizer.minimize(avg_cost)
File "/usr/local/lib/python3.8/dist-packages/paddle/distributed/fleet/fleet.py", line 1551, in minimize
  return self._minimize_impl(
File "/usr/local/lib/python3.8/dist-packages/paddle/distributed/fleet/fleet.py", line 1786, in _minimize_impl
  optimize_ops, params_grads = meta_optimizer.minimize(
File "/usr/local/lib/python3.8/dist-packages/paddle/distributed/fleet/meta_optimizers/meta_optimizer_base.py", line 103, in minimize
  optimize_ops, params_grads = self.minimize_impl(
File "/usr/local/lib/python3.8/dist-packages/paddle/distributed/fleet/meta_optimizers/raw_program_optimizer.py", line 185, in minimize_impl
  self._transpile_main_program(loss)
File "/usr/local/lib/python3.8/dist-packages/paddle/distributed/fleet/meta_optimizers/raw_program_optimizer.py", line 290, in _transpile_main_program
  self._allreduce_fusion_program()
File "/usr/local/lib/python3.8/dist-packages/paddle/distributed/fleet/meta_optimizers/raw_program_optimizer.py", line 503, in _allreduce_fusion_program
  block._insert_op_without_sync(
File "/usr/local/lib/python3.8/dist-packages/paddle/base/framework.py", line 4507, in _insert_op_without_sync
  op = Operator(block=self, desc=op_desc, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/paddle/base/framework.py", line 3016, in __init__
  for frame in traceback.extract_stack():

InvalidArgumentError: The start row index must be less than the end row index.But received the start index = 0, the end index = 0.
  [Hint: Expected begin_idx < end_idx, but received begin_idx:0 >= end_idx:0.] (at /paddle/paddle/phi/core/dense_tensor_impl.cc:309)
  [operator < coalesce_tensor > error]

C++ Traceback (most recent call last):

0 paddle::framework::ScopePool::Clear() 1 paddle::framework::ScopePool::DeleteScope(paddle::framework::Scope*) 2 paddle::framework::Scope::~Scope() 3 paddle::framework::Scope::DropKids() 4 paddle::framework::Scope::~Scope() 5 paddle::framework::Variable::PlaceholderImpl::~PlaceholderImpl()


Error Message Summary:

FatalError: Segmentation fault is detected by the operating system. [TimeInfo: Aborted at 1721380713 (unix time) try "date -d @1721380713" if you are using GNU date ] [SignalInfo: SIGSEGV (@0xa) received by PID 12438 (TID 0x7f966677c740) from PID 10 ]

LAUNCH INFO 2024-07-19 17:18:36,845 Exit code -11

SigureMo commented 1 month ago

具体模型是什么呢?有比较小的可复现代码么?

xieshuaix commented 1 month ago

具体模型是什么呢?有比较小的可复现代码么?

模型: 3层MLP

combined_features = paddle.concat(x=feature, axis=-1)
fc1 = paddle.static.nn.fc(x=combined_features, size=self.fc_dim, activation='relu')
fc2 = paddle.static.nn.fc(x=fc1, size=self.fc_dim, activation='relu')
fc3 = paddle.static.nn.fc(x=fc2, size=self.fc_dim, activation='relu')
predict = paddle.static.nn.fc(x=fc3, size=5)
avg_cost = paddle.nn.functional.cross_entropy(input=predict, label=label, reduction='mean')

训练代码

device = paddle.set_device("gpu")
place = paddle.CUDAPlace(0)
exe = paddle.static.Executor(device)
build_strategy = paddle.static.BuildStrategy()
build_strategy.enable_inplace = True
# build_strategy.memory_optimize = True
exec_strategy = paddle.static.ExecutionStrategy()
# distributed settings
dist_strategy = fleet.DistributedStrategy()
dist_strategy.build_strategy = build_strategy
dist_strategy.execution_strategy = exec_strategy
fleet.init(is_collective=True, strategy=dist_strategy)

optimizer = paddle.optimizer.Adam(learning_rate=train_conf['lr'],
        weight_decay=train_conf.get("weight_decay"))
optimizer = fleet.distributed_optimizer(optimizer)
optimizer.minimize(avg_cost)

main_program = paddle.static.default_main_program()
start_program = paddle.static.default_startup_program()
exe.run(start_program)
compiled_train_prog = paddle.static.CompiledProgram(main_program, build_strategy=build_strategy)

for batch in data_loader:
    loss_value, acc_value = exe.run(
        program=compiled_train_prog,
        feed=train_data,
        fetch_list=[avg_cost, batch_acc]
    )

启动脚本:

python -m paddle.distributed.launch --gpus=0,1 ...
zhiqiu commented 1 month ago

多卡训练,我们推荐使用新的自动并行方式,写法会更简单。具体可参考:https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/paddle_v3_features/auto_parallel_cn.html

xieshuaix commented 1 month ago

多卡训练,我们推荐使用新的自动并行方式,写法会更简单。具体可参考:https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/paddle_v3_features/auto_parallel_cn.html

会学习下, 不过先请帮看下问题吧, 大部分代码暂时不能升3.0, 也不知道升3是否能解决问题。