libingbingdev commented 1 year ago

描述问题

采用FasterRCNN作为baseline进行目标检测模型训练，训练后部署到现场工控机上进行图像预测，大部分情况正常，每天都会遇到几次偶发性错误。

环境

1.windows10 企业版 /i7-9700 CPU /16G RAM/ 64位操作系统/2080Ti 2.python==3.9.7; paddlepaddle-gpu==2.2.1 paddlex==2.0.0

模型训练代码

train_transforms = transforms.Compose([
    transforms.RandomDistort(),
    transforms.RandomHorizontalFlip(),
    transforms.ResizeByShort(short_size=1024, max_size=2048),
    transforms.Normalize(),
])

eval_transforms = transforms.Compose([
    transforms.ResizeByShort(short_size=1024, max_size=2048),
    transforms.Normalize(),
])

root_path = 'Full'
train_dataset = pdx.datasets.VOCDetection(
    data_dir=root_path,
    file_list=os.path.join(root_path, 'train_list.txt'),
    label_list=os.path.join(root_path, 'labels.txt'),
    transforms=train_transforms,
    shuffle=True)
eval_dataset = pdx.datasets.VOCDetection(
    data_dir=root_path,
    file_list=os.path.join(root_path, 'val_list.txt'),
    label_list=os.path.join(root_path, 'labels.txt'),
    transforms=eval_transforms)

train_dataset.add_negative_samples(image_dir='Background')

num_classes = len(train_dataset.labels) + 1

 model = pdx.det.FasterRCNN(
     num_classes=num_classes,
     backbone='ResNet50_vd_ssld',
     with_dcn=True,
     fpn_num_channels=64,
     with_fpn=True,
     test_pre_nms_top_n=500,
     test_post_nms_top_n=300)

model.train(
    num_epochs=20,
    train_dataset=train_dataset,
    train_batch_size=4,
    eval_dataset=eval_dataset,
    save_interval_epochs=1,
    metric='VOC',
    learning_rate=0.01,
    lr_decay_epochs=[12, 16],
    warmup_steps=500,
    save_dir='Output/Full/faster_rcnn_r50_vd_dcn',
    use_vdl=True,
    early_stop=True)

导出模型

paddlex --export_inference --model_dir=Output/faster_rcnn_r50_vd_dcn/best_model --save_dir=Output/faster_rcnn_r50_vd_dcn/

模型预测代码

model = pdx.load_model(path_to_model) result = model.predict(image)

model.yml

Model: FasterRCNN Transforms:

ResizeByShort: interp: LINEAR max_size: 2048 short_size: 1024
Normalize: is_scale: true mean:
- 0.485
- 0.456
- 0.406 std:
- 0.229
- 0.224
- 0.225
Padding: im_padding_value:
- 0.0
- 0.0
- 0.0 label_padding_value: 255 offsets: null pad_mode: 0 size_divisor: 32 target_size: null _Attributes: eval_metrics: bbox_map: 82.8894546295336 fixed_input_shape:
- -1
- 3
- -1
- -1 labels:
- guahua
- liehen
- posun
- queliao
- waixie
- zangwu model_type: detector num_classes: 7 _init_params: anchor_sizes:
- - 32
- - 64
- - 128
- - 256
- - 512 aspect_ratios:
- 0.5
- 1.0
- 2.0 backbone: ResNet50_vd_ssld fpn_num_channels: 64 keep_top_k: 100 nms_threshold: 0.5 num_classes: 7 rpn_batch_size_per_im: 256 rpn_fg_fraction: 0.5 score_threshold: 0.05 test_post_nms_top_n: 300 test_pre_nms_top_n: 500 with_dcn: true with_fpn: true completed_epochs: 0 status: Infer version: 2.0.0

错误信息

第一种： ERROR The dims of Inputs(Condition) and Inputs(X) should be same. But received Condition's shape is [3, 1], X's shape is [1, 1] [Hint: Expected cond_dims == x_dims, but received cond_dims:3, 1 != x_dims:1, 1.] (at C:/home/workspace/Paddle_release2/paddle/fluid/operators/where_op.cc:38) [operator < where > error]

第二种： ERROR The dims of Inputs(Condition) and Inputs(X) should be same. But received Condition's shape is [2, 1], X's shape is [1, 1] [Hint: Expected cond_dims == x_dims, but received cond_dims:2, 1 != x_dims:1, 1.] (at C:/home/workspace/Paddle_release2/paddle/fluid/operators/where_op.cc:38) [operator < where > error]

第三种： ERROR Dims of all Inputs(X) must be the same, but received input 1 dim is:1 not equal to input 0 dim:2.

[operator < stack > error]

第四种： ERROR Dims of all Inputs(X) must be the same, but received input 1 dim is:1 not equal to input 0 dim:4.

[operator < stack > error]

第五种： ERROR Broadcast dimension mismatch. Operands could not be broadcast together with the shape of X = [5] and the shape of Y = [3]. Received [5] in X is not equal to [3] in Y at i:0. [Hint: Expected x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1 == true, but received x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1:0 != true:1.] (at C:\home\workspace\Paddle_release2\paddle/fluid/operators/elementwise/elementwise_op_function.h:169) [operator < elementwise_min > error]

lailuboy commented 1 year ago

错误来看每次输入上有问题，确认出错误的时刻与其他时刻给模型的输入是一样的吗？

libingbingdev commented 1 year ago

错误来看每次输入上有问题，确认出错误的时刻与其他时刻给模型的输入是一样的吗？

采用的是海康线阵相机进行在线触发拍照，每次输入模型的都是204810243 的图像；报错时刻对应的图像有实时保存，跟正常情况下的图像是一致的。模型每天运行大概一万多次，查看运行日志报错信息大概有七、八次。

lailuboy commented 1 year ago

看代码每次给模型的输入是image，这个就是你说的2048*1024 3通道的图像是吧? 实时保存是说每次预测前都会将输入保存成本地文件？然后出错时用保存的图像再加载预测是OK的是吧？方便可以发一下模型预测前image的前处理代码以及出错时保存的图像。

libingbingdev commented 1 year ago

看代码每次给模型的输入是image，这个就是你说的2048*1024 3通道的图像是吧? 实时保存是说每次预测前都会将输入保存成本地文件？然后出错时用保存的图像再加载预测是OK的是吧？方便可以发一下模型预测前image的前处理代码以及出错时保存的图像。

相机触发后会先将图像存储在本地，然后再去读取图像进行加载预测。出错时保存的图像再加载预测是OK的，图像本身也没有问题。模型预测前image的处理代码：

img= cv2.imread(imagepath) rows, cols, channels = img.shape black = np.zeros([rows, cols, channels], img.dtype) original = cv2.addWeighted(img, c, black, 1-c, b)

try: result_full = predict.predict_img(self.model_full, original) check_pic = visualize.visualize_detection(original.copy(), result_full, threshold=config.threshold, save_dir=config.today.get_check_full_path()) except Exception as e: log.error(e) self.savePic(num, '00', 'kadun', original)

出错时存储的图片：链接：https://pan.baidu.com/s/1C7JkJ1R_TPTnj3fXrUc7tg 提取码：urru

PaddlePaddle / PaddleX

目标识别检测模型FasterRCNN,图像预测是时偶发性报错 #1669

描述问题

环境

模型训练代码

导出模型

模型预测代码

model.yml

错误信息