bowenc0221 / panoptic-deeplab

This is Pytorch re-implementation of our CVPR 2020 paper "Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation" (https://arxiv.org/abs/1911.10194)
Apache License 2.0
590 stars 117 forks source link

Confusions about multi-scale test #69

Closed Jensen-Su closed 3 years ago

Jensen-Su commented 3 years ago

I performed evaluation on different scales respectively and got the following results:

scale PQ AP IoU
0.5 57.2 27.9 77.2
1 61.9 36.2 80.2
1.5 61.4 34.2 79.5
2 61 32.3 76.8
ensemble[1, 1.5] 62.5 35.4 80.4
ensemble[0.5, 1, 1.5, 2] 61.8 35.3 80.6

where the scale 1 corresponds to 1024x2048 and flipping is always added. The table shows that the performances at scales 0.5 and 2 are much worse.
As expected, the table shows that adding scales 0.5 or 2 or both leads to worse performances.

All interpolations for network outputs are set mod=bilinear, align_corners=ture , and input image is resized using cv2.resize(img, (scaled_w, scaled_h), interpolation=cv2.INTER_LINEAR).

Here are my confusions:

bowenc0221 commented 3 years ago

Are you implementing the multi-scale test for Detectron2 code? Please follow the resizing implementation in Detectron2 to properly resize images. Detectron2 does not use open-cv to process images and I'm sure cv2.resize is the problem that causes degradation.

Jensen-Su commented 3 years ago

Are you implementing the multi-scale test for Detectron2 code? Please follow the resizing implementation in Detectron2 to properly resize images. Detectron2 does not use open-cv to process images and I'm sure cv2.resize is the problem that causes degradation.

Yes, I am using Detectron2, which uses PIL.Image.resize with mode Image.Bilinear to resize images. And my model was multi-scale trained using Detectron2 under default configs. With your suggestion, I made a comparison between the resize operations in PIL.Image with Image.Blinear and cv2 with cv2.INTER_LINEAR only to find small diffrences: scale0.5 PQ AP IoU
cv2.resize 57.2 27.9 77.2
Image.resize 57.0 27.4 77.0

The images were resized into shape [512, 1024] for inference, and results were interpolated to [1024, 2048] using F.interpolate with mod='bilinear', align_corners=ture for evaluation.

Are the degradations of 3 points in IoU and 8 points in AP reasonable compared to scale-1?

bowenc0221 commented 3 years ago

What is the PQ you got by running the following command?

python train_net.py --config-file configs/Cityscapes-PanopticSegmentation/panoptic_deeplab_R_52_os16_mg124_poly_90k_bs32_crop_512_1024_dsconv.yaml --eval-only MODEL.WEIGHTS /path/to/model_checkpoint INPUT.MIN_SIZE_TEST 512 INPUT.MAX_SIZE_TEST 1024
Jensen-Su commented 3 years ago

What is the PQ you got by running the following command?

python train_net.py --config-file configs/Cityscapes-PanopticSegmentation/panoptic_deeplab_R_52_os16_mg124_poly_90k_bs32_crop_512_1024_dsconv.yaml --eval-only MODEL.WEIGHTS /path/to/model_checkpoint INPUT.MIN_SIZE_TEST 512 INPUT.MAX_SIZE_TEST 1024
Here are the results I got by varying the INPUT config : MIN_TEST_SIZE MAX_TEST_SIZE PQ IoU AP
512 1024 55.1 76.6 26.3
1024 2048 61.5 79.8 36.3
1536 3072 60.9 79.1 34.6

I got even lower performance: PQ=55.1, IoU=76.6, AP=26.3.
(The MIN_SIZE_TRAIN setting is (512, 640, 704, 832, 896, 1024, 1152, 1216, 1344, 1408, 1536, 1664, 1728, 1856, 1920, 2048))

It seems that the dramatic performance degradation under scale 0.5 is inevitable?
which indicates that we shouldn't ensemble the scale 0.5?

bowenc0221 commented 3 years ago

Yes, if you only inference with the 0.5 scale the performance will drop for sure. But adding it to multi-scale testing can still help (i.e., averaging predictions from 0.5, 1.0, 2.0 scales etc.).

The training scale is for data augmentation, the purpose is different.