Haochen-Wang409 / U2PL

[CVPR'22 & IJCV'24] Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels & Using Unreliable Pseudo-Labels for Label-Efficient Semantic Segmentation
Apache License 2.0
436 stars 61 forks source link

About the performace of SupOnly and MT #78

Closed JoyHuYY1412 closed 2 years ago

JoyHuYY1412 commented 2 years ago

Hi Haochen, I have several general questions about your ablation.

  1. According to table 2 and table 3, it seems that MT ( mean teacher) is not helpful except in very data-scarce scenarios (1/16 split). Can you give me some hints about these results? As we know, MT is very useful in classification tasks, so I feel a little confused.

  2. I compared the Sup-only results with U2PL and AEL, it seems under 1/16 and 1/8 cases the results differ a lot. I think I will use your baselines for comparisons, and could you please tell me do you also use OHEM loss in cityscapes sup-only case? It will be helpful for me.

Thank you so much!

Haochen-Wang409 commented 2 years ago
  1. I am trying to answering your questions since it is a little bit strange where MT is worse than the supervised baseline under 1/8, 1/4, and 1/2 partition protocols according to Tab. 2. This might because the differences between image classification and semantic segmentation. It is known that providing correct image-level pseudo-labels is much easier than pixel-level counterparts. Thus, for semi-supervised semantic segmentation, strong augmentation plays an important role (e.g., CutOut and CutMix).
  2. Yes, we use OHEM when computing supervised loss for all experiments conducted on Cityscapes.
JoyHuYY1412 commented 2 years ago

Thank you for your reply!

So when armed with strong augmentation, does moving average teacher help? I saw you used the EMA teacher in the overall pipeline, have you compared the results with the simple copied and detached teacher?

Haochen-Wang409 commented 2 years ago

Sorry, we did not ablate EMA since it is a common practice in semi-supervised semantic segmentation. However, in semi-supervised image classification, FixMatch utlizes a copy of the student as the teacher instead of EMA.

From my perspective, EMA might be not so important for semi-supervised learning, but it is important for contrastive learning (please refer to MoCo). And since our work utlizes an extra contrastive loss, EMA might be indispensable in our framework.

JoyHuYY1412 commented 2 years ago

Thank you for your thoughts. They are very helpful and I hope to discuss more with you in the future.

Haochen-Wang409 commented 2 years ago

Please do not hesitate to contact me if you have further questions~

JoyHuYY1412 commented 2 years ago

Please do not hesitate to contact me if you have further questions~

Hi Haochen, thank you for your previous help!

Under the sup-only setting, I found the result in 1/8 CPS split for Pascal VOC is 74.56%, which is quite high. I didn't change the sup-only code, except for changing the data path.

Besides, when I remove all the strong augmentation and use mean-teacher for the unsupervised branch, the result only reaches 74.32% in early epochs and soon decreases.

Both results quite confuse me. The sup-only baselines seem to be higher than I expected and MT seems not to work for the semi-supervised segmentation tasks. Could you please give me some advice? I really appreciate your reply.

Haochen-Wang409 commented 2 years ago

In my opinion, if you want to verify the effectiveness of EMA in semi-supervised semantic segmentation, it may be better to compare results between CutMux (w/ or w/o EMA), instead of MT and sup-only. This is because strong augmentation is quite significant in semi-supervised semantic segmentation to prevent collapse.

As you have mentioned

When I remove all the strong augmentation and use mean-teacher for the unsupervised branch, the result only reaches 74.32% in early epochs and soon decreases.

Reasons for performance degradation might be fitting to incorrect pseudo-labels without strong data augmentation.

Recall the weak-to-strong pipeline, images are first fed into the teacher for pseudo-labels, and then images and generated pseudo-labels are jointly used to produce strong augmented data.

I try to explain why strong augmentation is efficiency against wrong pseudo-labels in the following sentance. The EMA teacher can not provide satisfactory pseudo-labels for the strong augmented samples, but we are urging the student to produce high-quality ones (these pseudo-ground-truths are generated under a weakly augmented manner).

By the way, may be the momentum coefficient can be further tuned.