HRNet / HRNet-Semantic-Segmentation

The OCR approach is rephrased as Segmentation Transformer: https://arxiv.org/abs/1909.11065. This is an official implementation of semantic segmentation for HRNet. https://arxiv.org/abs/1908.07919
Other
3.12k stars 685 forks source link

Image size differece between training and testing #76

Open AtsukiOsanai opened 4 years ago

AtsukiOsanai commented 4 years ago

Thank you for providing nice repostiroy. I'd like to ask about image size during training and testing on cityscapes dataset. For cityscapes training, you use (512, 1024) size of cropped image. In the single scale testing, however, inference has been done for whole image size, i.e. (1024, 2048). I found some works employ sliding inference with cropsize during training. (https://github.com/junfu1115/DANet/blob/master/encoding/models/base.py#L78-L179)

So, my questions are;

sunke123 commented 4 years ago

If adopting the whole images to train the network, the batchsize is too small, e.g. 1 images/gpu, which has a negative influence on BN and makes the training unstable.

I think that DANet use the sliding inference in the process of testing not training. We enter the whole image into the network for single-scale testing for fairly comparing to other methods. You can try multi-scale testing if you want to use the sliding inference. Sliding inference surely brings about gains but also results in speed loss.

AtsukiOsanai commented 4 years ago

Thank you for answering my question.

Maybe I could not express what I want to say exactly. I understand the importance of batchsize in semantic segmenatation and keep in mind to set the batchsize more than 8 when training.

You are correct about the pipeline of DANet. As well as DANet did, I train the FCN with random (769, 769) cropping, and try to measure the scores for both sliding inference with (769, 769) patch and whole image inference with (1024, 2048) size. When I apply the sliding method, I get 70% mIoU on cityscapes dataset. (The training has been done only for 40 epochs, which leads to worse score.) However, whole image inference achieves only 65% mIoU. I can't understand why my whole image inference does not predict the output successfully.

Do you have any knowledge around here ? Does it depend on the network architecture ? I really hope to be able to reduce my testing time. Thanks.

sunke123 commented 4 years ago

Sliding method is often used in image processing, such as image de-blocking and deblur, and leads to better performance. It crops the images into many overlapped patches, which also increasing the inference timecost.

HRNet with multi-scale testing (including sliding process) can improve 1%~1.5% on mIoU. I have no idea why there is 5% performance gap. It's too large. I'm not sure whether the architecture results in this problem. You can try other network, HRNet or PSPNet.

If you want to reduce testing time, you can concatenate the cropped image patches at the batchisze axis.

AtsukiOsanai commented 4 years ago

Accumulating images to batchsize axis is great idea for me. This should be used only for the same size input image, such as cityscapes, right ?

I compared my code with your one and found a difference. I apply torch.nn.functional.interpolate with align_corners=True to control the image size, however, you employ cv2.resize. Maybe align_corners=False is equivalent to cv2.resize, so I will check whole image inference again with align_corners=False.

After sharing my update, I will close this issue. Thanks.