PaddlePaddle / PaddleSeg

Easy-to-use image segmentation library with awesome pre-trained model zoo, supporting wide-range of practical tasks in Semantic Segmentation, Interactive Segmentation, Panoptic Segmentation, Image Matting, 3D Segmentation, etc.
https://arxiv.org/abs/2101.06175
Apache License 2.0
8.69k stars 1.68k forks source link

图像分割效果随图片size变化周期性地产生偏移的问题 #3385

Closed LHTcode closed 8 months ago

LHTcode commented 1 year ago

问题确认 Search before asking

Bug描述 Describe the Bug

用 paddleseg 提供的 MobileSeg 模型在眼球数据集(optic_disc_seg)上训练,使用 tools/predict.py 脚本对同一张图片的不同size进行推理,发现在图片的宽、高两个维度上分别对图片大小进行连续的变化,推理结果会产生周期性的偏移

具体实验数据

  1. H维度图片 size 序列:(512, 511), (512,510), (512, 509), ...... , (512, 496), (512, 495),共16张图片;
  2. W维度图片 size 序列:(511, 512), (510, 512), (509, 512), ...... , (496, 512), (495, 512),共16张图片。

实验效果

  1. H维度图片(512,511), (512,509), (512,507), (512,505), (512,503), (512,501), (512,499), (512,497), (512,495)

  2. W维度图片(511,512), (509,512), (507,512), (505,512), (503,512), (501,512), (499,512), (497,512), (495,512)

如上图所示,变化周期大致为15(行/列像素),在宽高两个维度上图像 size 分别递减到第16个像素时分割效果会发生突变,并且往往能够回复最佳的分割效果。同时,在多个 paddleseg 提供的模型中(PPLiteSeg、FCN、MobileSeg、UNet...)均有此现象,就不赘述了,在其他模型中此变化周期可能不同。并且我在自己的数据集上也复现了同样的现象。

同时,我使用 pytorch 单独训练的模型(UNet)并没有出现这种周期性的偏移现象

下面是我的参数文件

batch_size: 2
iters: 10000

train_dataset:
  type: Dataset
  dataset_root: data/optic_disc_seg
  train_path: data/optic_disc_seg/train_list.txt
  num_classes: 2
  mode: train
  transforms:
    - type: RandomPaddingCrop
      crop_size: [512, 512]
      im_padding_value: 0
      label_padding_value: 0
    - type: RandomHorizontalFlip
    - type: RandomVerticalFlip
    - type: Normalize

val_dataset:
  type: Dataset
  dataset_root: data/optic_disc_seg
  val_path: data/optic_disc_seg/val_list.txt
  num_classes: 2
  mode: val
  transforms:
    - type: Normalize

optimizer:
  type: SGD
  momentum: 0.9
  weight_decay: 4.0e-5

lr_scheduler:
  type: PolynomialDecay
  learning_rate: 0.01
  end_lr: 0
  power: 1.0

loss:
  types:
    - type: OhemCrossEntropyLoss
      min_kept: 130000
    - type: OhemCrossEntropyLoss
      min_kept: 130000
    - type: OhemCrossEntropyLoss
      min_kept: 130000
  coef: [1, 1, 1]

model:
  type: MobileSeg
  backbone:
    type: MobileNetV3_large_x1_0  # out channels: [24, 40, 112, 160]
    pretrained: https://paddleseg.bj.bcebos.com/dygraph/backbone/mobilenetv3_large_x1_0_ssld.tar.gz
  cm_bin_sizes: [1, 2, 4]
  cm_out_ch: 128
  arm_out_chs: [32, 64, 128]
  seg_head_inter_chs: [32, 32, 32]

无其余代码改动。

复现环境 Environment

Bug描述确认 Bug description confirmation

是否愿意提交PR? Are you willing to submit a PR?

LHTcode commented 1 year ago

补充:我后来发现改善该问题的办法是在使用export.py脚本导出模型时设置--input_shape训练原图大小或者是分割偏移的周期峰值,但是如果我将--input_shape设置为偏移周期的谷底值,则会将谷底时的偏移现象固定下来

Asthestarsfalll commented 1 year ago

可能是下采样和上采样导致的偏差,请再检查一下torch版本的训练、推理是否一致。 另外此种对输入图像在宽高上连续变化的测试方法有什么根据吗?

LHTcode commented 1 year ago

torch版本的训练、推理和paddleseg的训练推理不一致,目前正在做一致的工作。

这应该不是某种特别的测试方法,只是我的应用场景中需要在一张图片上的size可变的ROI上进行模型推理(经过了fastdeploy部署),因此发现了这个问题(仅仅缩小ROI的大小分割效果却时好时坏)。所以我将把数据集的图片进行了上述的裁剪操作并在paddleseg里面推理验证,发现问题仍然存在(可以排除fastdeploy的问题?),进而我又进行了多个模型的验证,问题也同样存在而且现象几乎相同。

LHTcode commented 1 year ago

昨天在保证 paddleapddle 端模型结构和 torch 端模型结构相同(U-Net的官方实现)的情况下,用tools/predict.py推理,还是复现了这个偏移的问题

Asthestarsfalll commented 1 year ago

昨天在保证 paddleapddle 端模型结构和 torch 端模型结构相同(U-Net的官方实现)的情况下,用tools/predict.py推理,还是复现了这个偏移的问题

推理流程也对齐了吗,或者可以考虑转换torch的权重过来再进行测试

LHTcode commented 1 year ago

推理流程不相同,torch 端用的是u-net仓库的推理脚本,paddlepaddle这边用的是paddleseg的推理脚本,不过我感觉可能不是推理脚本的问题 我今天就是打算用相同的模型参数试试,看看哪层有问题

LHTcode commented 1 year ago

请问这里为什么要设计成用插值代替转置卷积呢,我发现规律性偏移现象是源自于paddleseg提供的模型里面很多(fcn、mobilseg、ppliteseg、unet...)都用到了插值算法

# 这一段是\paddleseg\models\unet.py的代码
    def forward(self, x, short_cut):
        if self.use_deconv:
            x = self.deconv(x)
        else:
            x = F.interpolate(
                x,
                paddle.shape(short_cut)[2:],
                mode='bilinear',
                align_corners=self.align_corners)
        x = paddle.concat([x, short_cut], axis=1)
        x = self.double_conv(x)
        return x
# 这一段是unet论文仓库代码
    def forward(self, x1, x2):
        x1 = self.up(x1)
        # input is CHW
        diffY = x2.size()[2] - x1.size()[2]
        diffX = x2.size()[3] - x1.size()[3]

        x1 = F.pad(x1, [diffX // 2, diffX - diffX // 2,
                        diffY // 2, diffY - diffY // 2])
        # if you have padding issues, see
        # https://github.com/HaiyongJiang/U-Net-Pytorch-Unstructured-Buggy/commit/0e854509c2cea854e247a9c615f175f76fbb2e3a
        # https://github.com/xiaopeng-liao/Pytorch-UNet/commit/8ebac70e633bac59fc22bb5195e513d5832fb3bd
        x = torch.cat([x2, x1], dim=1)
        return self.conv(x)

并且我发现paddleseg给的unet模型里面没有这个自适应 pad 的操作, 没有这段代码的话使用转置卷积会报错(报错提示concat时两个矩阵size不对应),在paddleseg里面添加这段自适应 pad 之后就可以使用转置卷积了,模型推理的“规律性偏移”就消失了,但同时模型导出会失败,下面是导出时的报错信息:

File "c:\......\paddleseg\paddleseg\deploy\export.py", line 26, in forward
        outs = self.model(x)
    File "c:\......\paddleseg\paddleseg\models\unet.py", line 66, in forward
        x = self.decode(x, short_cuts)
    File "c:\......\paddleseg\paddleseg\models\unet.py", line 115, in forward
        for i in range(len(short_cuts)):
    File "c:\......\paddleseg\paddleseg\models\unet.py", line 116, in forward
        x = self.up_sample_list[i](x, short_cuts[-(i + 1)])
    File "c:\......\paddleseg\paddleseg\models\unet.py", line 147, in forward
        if self.use_deconv:
    File "c:\......\paddleseg\paddleseg\models\unet.py", line 151, in forward
            diffY = short_cut.shape[2] - x.shape[2]
            diffX = short_cut.shape[3] - x.shape[3]
            x = paddle.nn.functional.pad(x=x, pad=[diffX // 2, diffX - diffX // 2,
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                                                  diffY // 2, diffY - diffY // 2])
        else:

    File "C:\......\PaddleSeg\venv\lib\site-packages\paddle\nn\functional\common.py", line 1728, in pad
        helper.append_op(
    File "C:\......\PaddleSeg\venv\lib\site-packages\paddle\fluid\layer_helper.py", line 45, in append_op
        return self.main_program.current_block().append_op(*args, **kwargs)
    File "C:\......\PaddleSeg\venv\lib\site-packages\paddle\fluid\framework.py", line 4040, in append_op
        op = Operator(
    File "C:\......\PaddleSeg\venv\lib\site-packages\paddle\fluid\framework.py", line 3012, in __init__
        self._update_desc_attr(attr_name, attr_val)
    File "C:\......\PaddleSeg\venv\lib\site-packages\paddle\fluid\framework.py", line 3362, in _update_desc_attr
        self._update_desc_plain_attr(name, val)
    File "C:\......\PaddleSeg\venv\lib\site-packages\paddle\fluid\framework.py", line 3386, in _update_desc_plain_attr
        desc._set_int32s_attr(name, val)

TypeError: _set_int32s_attr(): incompatible function arguments. The following argument types are supported:
    1. (self: paddle.fluid.libpaddle.OpDesc, arg0: str, arg1: List[int]) -> None

Invoked with: <paddle.fluid.libpaddle.OpDesc object at 0x000001F13ACD2CB0>, 'paddings', [var tmp_3 : LOD_TENSOR.shape(1,).dtype(int32).stop_gradient(False), var tmp_6 : LOD_TENSOR.shape(1,).dtype(int32).stop_gradient(False), var
 tmp_8 : LOD_TENSOR.shape(1,).dtype(int32).stop_gradient(False), var tmp_11 : LOD_TENSOR.shape(1,).dtype(int32).stop_gradient(False), 0, 0]
Asthestarsfalll commented 1 year ago

@LHTcode

  1. 使用插值替换反卷积是目前分割中很常见的一种做法,可以减小计算量和参数量,加速推理,因此很多实时语义分割都会使用插值进行上采样
  2. paddleseg主要的应用场景应该还是输入尺寸相对固定,所以在自适应输入尺寸这方面考虑的不是很到位,也欢迎你提交PR
  3. 推理之前运行过前向吗,看起来是paddle pad的使用方法错误,可以参考一下官方文档,印象中和torch的有些区别
LHTcode commented 1 year ago

我看了一下官方的functional.pad好像也是这么用的,我按照这个 issue PaddlePaddle/Paddle#52392 尝试解决导出失败的问题,代码如下:

    def forward(self, x, short_cut):
        if self.use_deconv:
            x = self.deconv(x)
        else:
            x = self.up_sample_paddle(x)
            padding = paddle.full([4], 0, dtype=paddle.int32)
            diffY = short_cut.shape[2] - x.shape[2]
            diffX = short_cut.shape[3] - x.shape[3]
            padding[0] = diffX // 2
            padding[1] = diffX - diffX // 2
            padding[2] = diffY // 2
            padding[3] = diffY - diffY // 2
            x = paddle.nn.functional.pad(x=x, pad=padding.astype(paddle.float32), value=0.0)
            # x = F.interpolate(
            #     x,
            #     paddle.shape(short_cut)[2:],
            #     mode='bilinear',
            #     align_corners=self.align_corners)
        x = paddle.concat([x, short_cut], axis=1)
        x = self.double_conv(x)
        return x

现在能够成功用 export.py 脚本导出模型,但是在执行 deploy/python/infer.py 的时候报错了,报错信息如下:

Traceback (most recent call last):
  File "C:\......\PaddleSeg\deploy\python\infer.py", line 396, in <module>
    main(args)
  File "C:\......\PaddleSeg\deploy\python\infer.py", line 384, in main
    predictor.run(imgs_list)
  File "C:\......\PaddleSeg\deploy\python\infer.py", line 236, in run
    self.predictor.run()
ValueError: (InvalidArgument) input and filter data type should be consistent, but received input data type is int and filter type is float
  [Hint: Expected input_data_type == filter_data_type, but received input_data_type:2 != filter_data_type:5.] (at ..\paddle/fluid/operators/conv_op.cc:235)
  [operator < conv2d_fusion > error]

请问这种情况如何解决呢,我无法打断点进入self.predictor.run()函数内,没办法 debug 问题

Asthestarsfalll commented 1 year ago

@LHTcode 看报错是输入的数据类型和权重的类型对不上啊,检查一下看看