PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.78k stars 2.89k forks source link

训练时报错"Cudnn error, CUDNN_STATUS_BAD_PARAM" #1964

Closed xusen-7 closed 3 years ago

xusen-7 commented 3 years ago

我现在的环境 paddlepaddle-gpu = 1.8.4 python = 3.7.9 CUDA = 10.0 cudnn = 7.4.2 使用命令

python tools/train.py -c configs/ppyolo/ppyolo_custm.yml

ppyolo_custm.yml文件如下:

architecture: YOLOv3 use_gpu: true max_iters: 10000 log_smooth_window: 20 log_iter: 20 save_dir: output snapshot_iter: 500 metric: VOC pretrain_weights: https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_ssld_pretrained.tar weights: output/ppyolo/model_final num_classes: 5 use_fine_grained_loss: true use_ema: true ema_decay: 0.9998

YOLOv3: backbone: ResNet yolo_head: YOLOv3Head use_fine_grained_loss: true

ResNet: norm_type: sync_bn freeze_at: 0 freeze_norm: false norm_decay: 0. depth: 50 feature_maps: [3, 4, 5] variant: d dcn_v2_stages: [5]

YOLOv3Head: anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]] anchors: [[10, 13], [16, 30], [33, 23], [30, 61], [62, 45], [59, 119], [116, 90], [156, 198], [373, 326]] norm_decay: 0. coord_conv: true iou_aware: true iou_aware_factor: 0.4 scale_x_y: 1.05 spp: true yolo_loss: YOLOv3Loss nms: MatrixNMS drop_block: true

YOLOv3Loss: ignore_thresh: 0.7 scale_x_y: 1.05 label_smooth: true use_fine_grained_loss: true iou_loss: IouLoss iou_aware_loss: IouAwareLoss

IouLoss: loss_weight: 2.5 max_height: 608 max_width: 608

IouAwareLoss: loss_weight: 1.0 max_height: 608 max_width: 608

MatrixNMS: background_label: -1 keep_top_k: 100 normalized: false score_threshold: 0.01 post_threshold: 0.01

LearningRate: base_lr: 0.00333 schedulers:

  • !PiecewiseDecay gamma: 0.1 milestones:
    • 56000
    • 62000
  • !LinearWarmup start_factor: 0. steps: 4000

OptimizerBuilder: optimizer: momentum: 0.9 type: Momentum regularizer: factor: 0.0005 type: L2

READER: 'ppyolo_reader.yml' TrainReader: dataset: !VOCDataSet dataset_dir: /DL/PaddleDetection-release-2.0-beta/voc anno_path: train.txt use_default_label: false with_background: false mixup_epoch: 6500 batch_size: 16

EvalReader: inputs_def: fields: ['image', 'im_size', 'im_id', 'gt_bbox', 'gt_class', 'is_difficult'] num_max_boxes: 50 dataset: !VOCDataSet dataset_dir: /DL/PaddleDetection-release-2.0-beta/voc anno_path: valid.txt use_default_label: false with_background: false

TestReader: dataset: !ImageFolder use_default_label: false with_background: false

之后的输出是: (在下面这些输出之前还打印了一堆xml文件的路径)

/ML/yew/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "tools/train.py", line 399, in main() File "tools/train.py", line 270, in main outs = exe.run(compiled_train_prog, fetch_list=train_values) File "/ML/yew/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run six.reraise(*sys.exc_info()) File "/ML/yew/envs/paddle/lib/python3.7/site-packages/six.py", line 703, in reraise raise value File "/ML/yew/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run return_merged=return_merged) File "/ML/yew/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1167, in _run_impl return_merged=return_merged) File "/ML/yew/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 879, in _run_parallel tensors = exe.run(fetch_var_names, return_merged)._move_to_list() paddle.fluid.core_avx.EnforceNotMet:


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const>(char const&&, char const, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const, int) 2 paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const 3 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16> >::operator()(char const, char const, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) 4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext) const 5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const 6 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&) 7 paddle::framework::details::ComputationOpHandle::RunImpl() 8 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase) 9 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp(paddle::framework::details::OpHandleBase, std::shared_ptr<paddle::framework::BlockingQueue > const&, unsigned long) 10 std::_Function_handler<std::unique_ptr<std::future_base::_Result_base, std::future_base::_Result_base::_Deleter> (), std::future_base::_Task_setter<std::unique_ptr<std::future_base::_Result, std::future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&) 11 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&) 12 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const


Python Call Stacks (More useful to users):

File "/ML/yew/envs/paddle/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2610, in append_op attrs=kwargs.get("attrs", None)) File "/ML/yew/envs/paddle/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op return self.main_program.current_block().append_op(*args, **kwargs) File "/ML/yew/envs/paddle/lib/python3.7/site-packages/paddle/fluid/layers/nn.py", line 4207, in batch_norm type="batch_norm", inputs=inputs, outputs=outputs, attrs=attrs) File "/DL/PaddleDetection-release-2.0-beta/ppdet/modeling/backbones/resnet.py", line 238, in _conv_norm use_global_stats=global_stats) File "/DL/PaddleDetection-release-2.0-beta/ppdet/modeling/backbones/resnet.py", line 452, in c1_stage name=_name) File "/DL/PaddleDetection-release-2.0-beta/ppdet/modeling/backbones/resnet.py", line 473, in call res = self.c1_stage(res) File "/DL/PaddleDetection-release-2.0-beta/ppdet/modeling/architectures/yolo.py", line 61, in build body_feats = self.backbone(im) File "/DL/PaddleDetection-release-2.0-beta/ppdet/modeling/architectures/yolo.py", line 159, in train return self.build(feed_vars, mode='train') File "tools/train.py", line 136, in main train_fetches = model.train(feed_vars) File "tools/train.py", line 399, in main()


Error Message Summary:

ExternalError: Cudnn error, CUDNN_STATUS_BAD_PARAM at (/paddle/paddle/fluid/operators/batch_norm_op.cu:319) [operator < batch_norm > error]

当我训练模型失败了之后,接着去运行darknet的训练命令获得如下错误

cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED

在此之前darknet可以正常运行,谁来救救孩子Orz

xusen-7 commented 3 years ago

好了,问题解决了,我把cudnn换成7.6就ok了,虽然不知道本来好好的darknet为什么也受影响了,但是问题解决了darknet也正常了。