PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.6k stars 2.86k forks source link

SENet154-vd-FPN Cascade Mask read数据时出错,求大佬帮忙 #856

Closed KK-Jiang closed 3 years ago

KK-Jiang commented 4 years ago

大佬在上: aistudio训练报[operator < read > error]:Blocking queue is killed because the data reader raises an exception。

版本、环境信息: 1)PaddlePaddle版本:1.8.0 2)系统环境|GPU:aistudio上,v100 3)PaddleDetection 0.3

训练信息 1)单卡 2)16G 3)错误为[operator < read > error]

复现信息:使用官方PaddleDetection-release-0.3,配置文件cascade_mask_rcnn_dcnv2_se154_vd_fpn_gn_s1x.yml,修改了class_num, batch_size, 数据路径,lr策略等基础信息,然后直接训练模型,就出错。 我尝试在我的window机器上训练,好不容易装好环境,报同样的错误

问题描述: err log如下,我做如下尝试都没有解决问题:我尝试了将work_num逐渐减小,非0的时候仍然报错,为0的时候,放十几个小时都显示正常log然后不动;将DataLoader.from_generator的capacity改小;use_double_buffer改为False;iterable改为True(默认False);以上都无效:

具体错误如下: 2020-05-26 11:24:35,963-INFO: places would be ommited when DataLoader is not iterable 2020-05-26 11:24:39,118-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]] 2020-05-26 11:24:39,119-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-1] exits for reason[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]]] 2020-05-26 11:24:39,119-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-2] exits for reason[consumer[consumer-c14-1] exits for reason[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]]]] 2020-05-26 11:24:39,119-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-3] exits for reason[consumer[consumer-c14-2] exits for reason[consumer[consumer-c14-1] exits for reason[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]]]]] 2020-05-26 11:24:39,119-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-4] exits for reason[consumer[consumer-c14-3] exits for reason[consumer[consumer-c14-2] exits for reason[consumer[consumer-c14-1] exits for reason[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]]]]]] 2020-05-26 11:24:39,119-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-5] exits for reason[consumer[consumer-c14-4] exits for reason[consumer[consumer-c14-3] exits for reason[consumer[consumer-c14-2] exits for reason[consumer[consumer-c14-1] exits for reason[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]]]]]]] 2020-05-26 11:24:39,119-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-6] exits for reason[consumer[consumer-c14-5] exits for reason[consumer[consumer-c14-4] exits for reason[consumer[consumer-c14-3] exits for reason[consumer[consumer-c14-2] exits for reason[consumer[consumer-c14-1] exits for reason[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]]]]]]]] 2020-05-26 11:24:39,120-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-7] exits for reason[consumer[consumer-c14-6] exits for reason[consumer[consumer-c14-5] exits for reason[consumer[consumer-c14-4] exits for reason[consumer[consumer-c14-3] exits for reason[consumer[consumer-c14-2] exits for reason[consumer[consumer-c14-1] exits for reason[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]]]]]]]]] 2020-05-26 11:24:39,120-WARNING: Your reader has raised an exception! Exception in thread Thread-10: Traceback (most recent call last): File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/threading.py", line 870, in run self._target(*self._args, *self._kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 1156, in thread_main six.reraise(sys.exc_info()) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/six.py", line 693, in reraise raise value File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 1136, in thread_main for tensors in self._tensor_reader(): File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 1206, in tensor_reader_impl for slots in paddle_reader(): File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/data_feeder.py", line 505, in reader_creator for item in reader(): File "/home/aistudio/PaddleDetection-release-0.3/ppdet/data/reader.py", line 421, in _reader reader.reset() File "/home/aistudio/PaddleDetection-release-0.3/ppdet/data/parallel_map.py", line 259, in reset assert not self._exit, "cannot reset for already stopped dataset" AssertionError: cannot reset for already stopped dataset

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") loading annotations into memory... Done (t=12.03s) creating index... index created! Traceback (most recent call last): File "PaddleDetection-release-0.3/tools/train.py", line 366, in main() File "PaddleDetection-release-0.3/tools/train.py", line 239, in main outs = exe.run(compiled_train_prog, fetch_list=train_values) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run six.reraise(*sys.exc_info()) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/six.py", line 693, in reraise raise value File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run return_merged=return_merged) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1167, in _run_impl return_merged=return_merged) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 879, in _run_parallel tensors = exe.run(fetch_var_names, return_merged)._move_to_list() paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers): 0 std::string paddle::platform::GetTraceBackString<std::string const&>(std::string const&, char const, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const, int) 2 paddle::operators::reader::BlockingQueue<std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor > >::Receive(std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >) 3 paddle::operators::reader::PyReader::ReadNext(std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >) 4 std::_Function_handler<std::unique_ptr<std::future_base::_Result_base, std::future_base::_Result_base::_Deleter> (), std::future_base::_Task_setter<std::unique_ptr<std::future_base::_Result, std::future_base::_Result_base::_Deleter>, unsigned long> >::_M_invoke(std::_Any_data const&) 5 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&) 6 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const

Python Call Stacks (More useful to users): File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2610, in append_op attrs=kwargs.get("attrs", None)) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 1078, in _init_non_iterable attrs={'drop_last': self._drop_last}) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 976, in init self._init_non_iterable() File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 608, in from_generator iterable, return_list, drop_last) File "/home/aistudio/PaddleDetection-release-0.3/ppdet/modeling/architectures/cascade_mask_rcnn.py", line 426, in build_inputs iterable=iterable) if use_dataloader else None File "PaddleDetection-release-0.3/tools/train.py", line 112, in main feed_vars, train_loader = model.build_inputs(**inputs_def) File "PaddleDetection-release-0.3/tools/train.py", line 366, in main()

Error Message Summary: Error: Blocking queue is killed because the data reader raises an exception [Hint: Expected killed != true, but received killed:1 == true:1.] at (/paddle/paddle/fluid/operators/reader/blocking_queue.h:141) [operator < read > error]

KK-Jiang commented 4 years ago

补充一下:同一个数据跑yolov3就没问题, 跑配置文件cascade_rcnn_cbr200_vd_fpn_dcnv2_nonlocal_softnms.yml也没问题

jerrywgz commented 4 years ago

可以看下你的配置文件吗,我这边在P40上测试coco数据集是可以正常跑的

KK-Jiang commented 4 years ago

可以看下你的配置文件吗,我这边在P40上测试coco数据集是可以正常跑的

architecture: CascadeMaskRCNN max_iters: 300000 snapshot_iter: 10 use_gpu: true log_iter: 20 log_smooth_window: 20 save_dir: output pretrain_weights: https://paddle-imagenet-models-name.bj.bcebos.com/SENet154_vd_caffe_pretrained.tar weights: output/cascade_mask_rcnn_dcn_se154_vd_fpn_gn_s1x/model_final metric: COCO num_classes: 34

CascadeMaskRCNN: backbone: SENet fpn: FPN rpn_head: FPNRPNHead roi_extractor: FPNRoIAlign bbox_head: CascadeBBoxHead bbox_assigner: CascadeBBoxAssigner mask_assigner: MaskAssigner mask_head: MaskHead

SENet: depth: 152 feature_maps: [2, 3, 4, 5] freeze_at: 2 group_width: 4 groups: 64 norm_type: bn freeze_norm: True variant: d dcn_v2_stages: [3, 4, 5] std_senet: True

FPN: max_level: 6 min_level: 2 num_chan: 256 spatial_scale: [0.03125, 0.0625, 0.125, 0.25] freeze_norm: False norm_type: gn

FPNRPNHead: anchor_generator: aspect_ratios: [0.5, 1.0, 2.0] variance: [1.0, 1.0, 1.0, 1.0] anchor_start_size: 32 max_level: 6 min_level: 2 num_chan: 256 rpn_target_assign: rpn_batch_size_per_im: 256 rpn_fg_fraction: 0.5 rpn_negative_overlap: 0.3 rpn_positive_overlap: 0.7 rpn_straddle_thresh: 0.0 train_proposal: min_size: 0.0 nms_thresh: 0.7 pre_nms_top_n: 2000 post_nms_top_n: 2000 test_proposal: min_size: 0.0 nms_thresh: 0.7 pre_nms_top_n: 1000 post_nms_top_n: 1000

FPNRoIAlign: canconical_level: 4 canonical_size: 224 max_level: 5 min_level: 2 box_resolution: 7 sampling_ratio: 2 mask_resolution: 14

MaskHead: dilation: 1 conv_dim: 256 num_convs: 4 resolution: 28 norm_type: gn

CascadeBBoxAssigner: batch_size_per_im: 512 bbox_reg_weights: [10, 20, 30] bg_thresh_hi: [0.5, 0.6, 0.7] bg_thresh_lo: [0.0, 0.0, 0.0] fg_fraction: 0.25 fg_thresh: [0.5, 0.6, 0.7]

MaskAssigner: resolution: 28

CascadeBBoxHead: head: CascadeXConvNormHead nms: keep_top_k: 100 nms_threshold: 0.5 score_threshold: 0.05

CascadeXConvNormHead: norm_type: gn

LearningRate: base_lr: 0.01 schedulers:

!PiecewiseDecay gamma: 0.1 milestones: [150000, 240000, 280000] !LinearWarmup start_factor: 0.01 steps: 2000 OptimizerBuilder: optimizer: momentum: 0.9 type: Momentum regularizer: factor: 0.0001 type: L2

TrainReader: batch_size: 2 inputs_def: fields: ['image', 'im_info', 'im_id', 'gt_bbox', 'gt_class', 'is_crowd', 'gt_mask'] dataset: !COCODataSet dataset_dir: data/ image_dir: train/image anno_path: train/train.json with_background: False sample_transforms:

!DecodeImage to_rgb: false !RandomDistort is_order: False !RandomFlipImage is_mask_flip: true is_normalized: false prob: 0.5 !NormalizeImage is_channel_first: false is_scale: False mean: 125.95 142.83 156.08 std: 17.23 15.18 13.11 !ResizeImage interp: 1 target_size: 416 448 480 512 544 576 608 640 672 704 736 768 800 832 864 896 928 960 992 1024 max_size: 1200 use_cv2: true !Permute channel_first: true to_bgr: false batch_transforms: !PadBatch pad_to_stride: 32 worker_num: 8 shuffle: true EvalReader: batch_size: 2 inputs_def: fields: ['image', 'im_info', 'im_id', 'im_shape'] dataset: !COCODataSet dataset_dir: data/ anno_path: train/valid.json image_dir: train/image with_background: False sample_transforms:

!DecodeImage to_rgb: False !NormalizeImage is_channel_first: false is_scale: False mean: 125.95 142.83 156.08 std: 17.23 15.18 13.11 !ResizeImage interp: 1 target_size: 800 max_size: 1200 use_cv2: true !Permute channel_first: true to_bgr: false batch_transforms: !PadBatch pad_to_stride: 32 worker_num: 8 drop_empty: false TestReader: batch_size: 1 inputs_def: fields: ['image', 'im_info', 'im_id', 'im_shape'] dataset: !ImageFolder anno_path: train/valid.json with_background: False sample_transforms:

!DecodeImage to_rgb: False !NormalizeImage is_channel_first: false is_scale: False mean: 125.95 142.83 156.08 std: 17.23 15.18 13.11 !Permute channel_first: true to_bgr: false batch_transforms: !PadBatch pad_to_stride: 32 worker_num: 1

jerrywgz commented 4 years ago

该模型的num_class应增加背景类,同时reader中的with_backbground也应该设为true

KK-Jiang commented 4 years ago

该模型的num_class应增加背景类,同时reader中的with_backbground也应该设为true

尝试过修改,同样报错

wangbo-git commented 3 years ago

可以看一下你用相同数据集 跑yolov3模型 正常时候的配置文件吗