About the `evaluation bug'

LiheYoung / ST-PlusPlus

[CVPR 2022] ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation

https://arxiv.org/abs/2106.05095

MIT License

233 stars 33 forks source link

About the `evaluation bug' #3

Closed Haochen-Wang409 closed 3 years ago

Haochen-Wang409 commented 3 years ago

Hi, I'm a beginner in semantic segmentation, and thanks for your great work. It seems ST++ is the only method which select `reliable DT mask' for unlabeled images.

I'm interested in the `evaluation bug' metioned on Table.1 Could you give me a brief introduction about it?

LiheYoung commented 3 years ago

Hi Haochen, thank you for your interest in our work.

The evaluation bug is reported by the authors of GCT. Briefly speaking, in the GCT original implementation and reported performance, the authors did a CenterCrop operation on testing images during inference which indeed shouldn't be conducted. The right practice is to predict and evaluate each testing image on its original resolution.

You may refer to the second Note at this link for more details.

Haochen-Wang409 commented 3 years ago

Thanks for your explaination! If we must evaluate each testing image on the original shape, the batch size have to be 1 when validate. Am I right?

LiheYoung commented 3 years ago

Yes, you are right! :)

Haochen-Wang409 commented 3 years ago

Thanks a lot!

pascal1129 commented 3 years ago

I found CenterCrop was uesd in Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision and Pixel Contrastive-Consistent Semi-Supervised Semantic Segmentation, which was published in CVPR 2021 and ICCV 2021 respectively. So i am quite confused about whether to use CenterCrop in evaluation.

LiheYoung commented 3 years ago

From my perspective, it is more practical to evaluate on original resolution in semantic segmentation.