Confusions about multi-scale test

bowenc0221 / panoptic-deeplab

This is Pytorch re-implementation of our CVPR 2020 paper "Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation" (https://arxiv.org/abs/1911.10194)

Apache License 2.0

590 stars 117 forks source link

Confusions about multi-scale test #69

Closed Jensen-Su closed 3 years ago

Jensen-Su commented 3 years ago

I performed evaluation on different scales respectively and got the following results:

scale	PQ	AP	IoU
0.5	57.2	27.9	77.2
1	61.9	36.2	80.2
1.5	61.4	34.2	79.5
2	61	32.3	76.8
ensemble[1, 1.5]	62.5	35.4	80.4
ensemble[0.5, 1, 1.5, 2]	61.8	35.3	80.6

where the scale 1 corresponds to 1024x2048 and flipping is always added. The table shows that the performances at scales 0.5 and 2 are much worse.
As expected, the table shows that adding scales 0.5 or 2 or both leads to worse performances.

All interpolations for network outputs are set mod=bilinear, align_corners=ture , and input image is resized using cv2.resize(img, (scaled_w, scaled_h), interpolation=cv2.INTER_LINEAR).

Here are my confusions:

Is there any problem with the performance at each scale?
Is the intuition reasonable that adding scales with much worse performance to ensemble harms the performance?
Any suggestions for further debugging?

bowenc0221 commented 3 years ago

Are you implementing the multi-scale test for Detectron2 code? Please follow the resizing implementation in Detectron2 to properly resize images. Detectron2 does not use open-cv to process images and I'm sure cv2.resize is the problem that causes degradation.

Jensen-Su commented 3 years ago

Are you implementing the multi-scale test for Detectron2 code? Please follow the resizing implementation in Detectron2 to properly resize images. Detectron2 does not use open-cv to process images and I'm sure cv2.resize is the problem that causes degradation.

Yes, I am using Detectron2, which uses `PIL.Image.resize` with mode `Image.Bilinear` to resize images. And my model was multi-scale trained using Detectron2 under default configs. With your suggestion, I made a comparison between the `resize` operations in `PIL.Image` with `Image.Blinear` and `cv2` with `cv2.INTER_LINEAR` only to find small diffrences:	scale0.5	PQ	AP	IoU
cv2.resize	57.2	27.9	77.2
Image.resize	57.0	27.4	77.0

The images were resized into shape [512, 1024] for inference, and results were interpolated to [1024, 2048] using F.interpolate with mod='bilinear', align_corners=ture for evaluation.

Are the degradations of 3 points in IoU and 8 points in AP reasonable compared to scale-1?

bowenc0221 commented 3 years ago

What is the PQ you got by running the following command?

python train_net.py --config-file configs/Cityscapes-PanopticSegmentation/panoptic_deeplab_R_52_os16_mg124_poly_90k_bs32_crop_512_1024_dsconv.yaml --eval-only MODEL.WEIGHTS /path/to/model_checkpoint INPUT.MIN_SIZE_TEST 512 INPUT.MAX_SIZE_TEST 1024

Jensen-Su commented 3 years ago

What is the PQ you got by running the following command?
python train_net.py --config-file configs/Cityscapes-PanopticSegmentation/panoptic_deeplab_R_52_os16_mg124_poly_90k_bs32_crop_512_1024_dsconv.yaml --eval-only MODEL.WEIGHTS /path/to/model_checkpoint INPUT.MIN_SIZE_TEST 512 INPUT.MAX_SIZE_TEST 1024
Here are the results I got by varying the INPUT config : MIN_TEST_SIZE MAX_TEST_SIZE PQ IoU AP

512 1024 55.1 76.6 26.3

1024 2048 61.5 79.8 36.3

1536 3072 60.9 79.1 34.6

Here are the results I got by varying the `INPUT` config :	MIN_TEST_SIZE	MAX_TEST_SIZE	PQ	IoU	AP
512	1024	55.1	76.6	26.3
1024	2048	61.5	79.8	36.3
1536	3072	60.9	79.1	34.6

I got even lower performance: PQ=55.1, IoU=76.6, AP=26.3.
(The MIN_SIZE_TRAIN setting is (512, 640, 704, 832, 896, 1024, 1152, 1216, 1344, 1408, 1536, 1664, 1728, 1856, 1920, 2048))

It seems that the dramatic performance degradation under scale 0.5 is inevitable?
which indicates that we shouldn't ensemble the scale 0.5?

bowenc0221 commented 3 years ago

Yes, if you only inference with the 0.5 scale the performance will drop for sure. But adding it to multi-scale testing can still help (i.e., averaging predictions from 0.5, 1.0, 2.0 scales etc.).

The training scale is for data augmentation, the purpose is different.