GewelsJI / FSNet

Full-Duplex Strategy for Video Object Segmentation, ICCV, 2021.
Apache License 2.0
65 stars 7 forks source link

effectiveness of the PPM #2

Closed clelouch closed 3 years ago

clelouch commented 3 years ago

Thansk for your code. I am not quite understand why you plug a PPM into each decoder. In the last decoder, the ppm extracts features with resolutions of 1x1, 2x2, 3x3, and 6x6. However, the size of the input feature is 88x88. It seems that these intermediate features extracted by ppm are too coarse to facilitate the refinement of the input feature. Have you evaluate the effectiveness of the PPM?

GewelsJI commented 3 years ago

Hi, @clelouch

Thanks your attention to our work.

We use four pooling operation to extract the discriminative feature with different perception scales, which does not change the resolution of input feature. The scale of 1x1, 2x2, 3x3, and 6x6 mean the kernel size in the pooling operation. Besides, in our initial experiments, the ppm introduced helps the network obtain more refined features in the aggregation of UNet decoder, and thus, it can slightly improve the network's performance. You can ablate it and discuss with me later.

Hope it helps you a lot.

Best regards.

clelouch commented 3 years ago

@GewelsJI . Thanks for your reply. If i am not wrong, you use the adaptiveavgpool2d. image The 'size' in pool funcion indicates the output size instead of the pooling kernel size. image I guess the ppm is effective because it enlarges the receptive field size. Then, only emploing ppm in the first decoder may be enough.

GewelsJI commented 3 years ago

Hi @clelouch

I am very sorry for the misleading information yesterday because I combined the pooling operation with the upsampling operation to recover the origin shape.

Yes. You are right. The ppm enlarges the RF size. This operation is an alternative opinion cause you can replace it with another multi-scale fusion module.

But I have never attempted how much ppm should we adopt in this network. It is an effective way to reduce the model's parameter and then seek better trade-offs. If you have sufficient computing resources, you can try different variants to verify it.

Looking forward to the next discussion.

Best regards.

clelouch commented 3 years ago

@GewelsJI Thanks for your reply. I guess it might be better to employ the Adaptiveavgpool in the highest decoder stage to enlarge the RF size while use the ppm in poolnet in other decoder stages to generate multi-scale intermediate features with finer spatial resolution. It is worth noting that the size of the input feature of the last decoder is 88x88. Thus, the intermediate features generated by adaptiveavgpool may be too coarse for pixel-wise prediction. For example, the 1x1 Adaptiveavgpool can only produce a feature map with the same value. image If i am not wrong, the ppm is utilized to exploit multi-scale sptial cues. Thus, feeding features with larger sizes to average pooling layers with different subsampling rates (x2, x4, x8) might be better, since it maintains the sptial structure.

GewelsJI commented 3 years ago

@clelouch

Yes. Those suggestions sound reasonable very much.

Thank you for providing constructive suggestions here. We will have a try soon for further extension of this paper.

It's a happy and nice discussion. Thank you again. I hope you have a nice working day. And if you have any questions or ideas, please feel free to contact me via email (gepengai.ji@gmail.com)

Best regards.

clelouch commented 3 years ago

@GewelsJI I also lean a lot from your work. Looking forward to your further extension. 😄