PaddlePaddle / PaddleX

Low-code development tool based on PaddlePaddle(飞桨低代码开发工具)
Apache License 2.0
4.76k stars 936 forks source link

昇腾910B训练PaddleDetection——PP-YOLOE_plus-S失败 #2005

Open 1737686924 opened 2 days ago

1737686924 commented 2 days ago

Checklist:

  1. 查找历史相关issue寻求解答
  2. 翻阅FAQ常见问题汇总和答疑
  3. 确认bug是否在新版本里还未修复
  4. 翻阅PaddleX API文档说明

描述问题

PaddleX 支持对数据集进行校验,确保数据集格式符合 PaddleX 的相关要求。同时在数据校验时,能够对数据集进行分析,统计数据集的基本信息。

python main.py -c paddlex/configs/object_detection/PP-YOLOE_plus-S.yaml \ -o Global.mode=check_dataset \ -o Global.dataset_dir=./dataset/det_coco_examples

成功

复现

  1. 您是否已经正常运行我们提供的教程

  2. 您是否在教程的基础上修改代码内容?还请您提供运行的代码 python main.py -c paddlex/configs/object_detection/PP-YOLOE_plus-S.yaml \ -o Global.mode=train \ -o Global.dataset_dir=./dataset/det_coco_examples \ -o Global.output=ppyolo_plus_s_output \ -o Global.device="npu:0,1,2,3"

  3. 您使用的数据集是?

  4. 请提供您出现的报错信息及相关log ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_use_stride_kernel', current_value=False, default_value=True)

    I0918 23:46:05.051712 973133 tcp_utils.cc:130] Successfully connected to 127.0.0.1:52457 loading annotations into memory... Done (t=0.00s) creating index... index created! W0918 23:46:24.526521 973133 dygraph_functions.cc:83150] got different data type, run type promotion automatically, this may cause data type been changed.

    \


C++ Traceback (most recent call last):

0 egr::Backward(std::vector<paddle::Tensor, std::allocator > const&, std::vector<paddle::Tensor, std::allocator > const&, bool) 1 egr::RunBackward(std::vector<paddle::Tensor, std::allocator > const&, std::vector<paddle::Tensor, std::allocator > const&, bool, bool, std::vector<paddle::Tensor, std::allocator > const&, bool, std::vector<paddle::Tensor, std::allocator > const&) 2 Conv2dGradNodeFinal::operator()(paddle::small_vector<std::vector<paddle::Tensor, std::allocator >, 15u>&, bool, bool) 3 paddle::experimental::conv2d_grad(paddle::Tensor const&, paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&, paddle::Tensor, paddle::Tensor) 4 void custom_kernel::Conv2DGradKernel<float, phi::CustomContext>(phi::CustomContext const&, phi::DenseTensor const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&, phi::DenseTensor, phi::DenseTensor) 5 aclnnConvolutionBackward 6 InitL2Phase2Context(char const, aclOpExecutor) 7 GetOpExecCacheFromExecutor(aclOpExecutor*)


Error Message Summary:

FatalError: Segmentation fault is detected by the operating system. [TimeInfo: Aborted at 1726674458 (unix time) try "date -d @1726674458" if you are using GNU date ] LAUNCH INFO 2024-09-18 23:47:48,695 Exit code -11 [SignalInfo: SIGSEGV (@0xed94d) received by PID 973133 (TID 0xffffa00bae90) from PID 973133 ]

Traceback (most recent call last): File "/work/workspace/PaddleX/paddlex/utils/result_saver.py", line 30, in wrap result = func(self, *args, kwargs) File "/work/workspace/PaddleX/paddlex/engine.py", line 42, in run trainer.train() File "/work/workspace/PaddleX/paddlex/modules/base/trainer/trainer.py", line 61, in train train_result = self.pdx_model.train(self.get_train_kwargs()) File "/work/workspace/PaddleX/paddlex/repo_apis/PaddleDetection_api/object_det/model.py", line 109, in train return self.runner.train( File "/work/workspace/PaddleX/paddlex/repo_apis/PaddleDetection_api/object_det/runner.py", line 54, in train return self.run_cmd( File "/work/workspace/PaddleX/paddlex/repo_apis/base/runner.py", line 359, in run_cmd raise CalledProcessError( paddlex.utils.errors.others.CalledProcessError: Command ['/usr/bin/python', '-m', 'paddle.distributed.launch', '--devices', '0,1,2,3', '--log_dir', '/work/workspace/PaddleX/ppyolo_plus_s_output/distributed_train_logs', 'tools/train.py', '--eval', '--config', '/root/.paddlex/tmp99soy5_c/detmodel_PP-YOLOE_plus-S.yml', '--use_vdl', 'True', '--vdl_log_dir', '/work/workspace/PaddleX/ppyolo_plus_s_output'] returned non-zero exit status 245.

环境

  1. 请提供您使用的PaddlePaddle和PaddleX的版本号 3.0-beta

  2. 请提供您使用的操作系统信息,如Linux/Windows/MacOS

  3. 请问您使用的Python版本是?

  4. 请问您使用的CUDA/cuDNN的版本号是?

a31413510 commented 12 hours ago

请问使用的镜像和paddle包是文档里提供的吗 https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-beta/docs/tutorials/INSTALL_OTHER_DEVICES.md