LiheYoung / UniMatch

[CVPR 2023] Revisiting Weak-to-Strong Consistency in Semi-Supervised Semantic Segmentation
https://arxiv.org/abs/2208.09910
MIT License
478 stars 60 forks source link

An observation of using 513x513 crop size under 92 split. #20

Closed BBBBchan closed 1 year ago

BBBBchan commented 1 year ago

Great work! According to your paper, UniMatch performs well on most splits, especially small ones (like 92, 183). However, when I tried to reproduce and further explore some more, I found an interesting phenomenon.

In most semi-supervised semantic segmentation methods, using a larger crop size(513 vs 321) usually leads to a performance improvement. When 92 split is selected, UniMatch can achieve good performance (74.5-75.0) in the crop size 321 scenario. However, under the crop size of 513x513, the performance can only reach around 72.5-73.0, and the overfitting will appear very early and lead to performance degradation.

I noticed that in the paper, the result you reported is also the scene under 321 crop size. I wonder if you have done experiments with 513 crop size under a small number of splits(92 or 183)?

Attached here is a change curve of miou during training using 92 split at 513 crop size. The highest performance will be reached at about 10 epochs (72.8). After training for 80 epochs, the model performance will be around 68. image

LiheYoung commented 1 year ago

Thank you for pointing it out. Actually, we have similar observations during our explorations after the submission.

Concretely, on 92 and 183 splits, the training size 321 is better than size 513. However, on higher-data regimes, such as 366, 732, 1464, 1/16, 1/8, and 1/4 labeled images, the training size 513 yields stronger performance. Moreover, the phenomenon seems not to be limited to our UniMatch. We observe similar trends in our other algorithms.

We conjecture that, this is because the smaller training size 321 serves as a stronger data augmentation than the larger size 513. The cropped training images are more diverse under size 321. Meantime, in low-data regimes, such as 92 and 183 labeled images, they typically require stronger augmentation to avoid overfitting. Therefore, I think the smaller training size works in this way.

BBBBchan commented 1 year ago

Thanks for your response.

According to your ablation studies, using drop channel operation is critical to UniMatch (63.9 -> 72.0) and our reproduction (63.3 -> 73.3) proves that. However, under a "wrong" experimental setting, we accidentally did not remove the forward of the drop channel (the same forward as UniPerb and the same backward as FixMatch) when we tested the performance of FixMatch. However, at this time, the model performance reached 73.9. It seems that the forward drop channel feature itself is very important, and whether to use backward for gradient return does not seem to be necessary. We think this is counter-intuitive.

What is your opinion on this phenomenon? Have similar experiments been carried out during the experiment?

LiheYoung commented 1 year ago

Do you mean you use plain FixMatch for training but drop half of its feature channels only during test and achieve 73.9% mIoU?

I think it is very counter-intuitive. Do you manually erase half of its channels to zero during test? Since we use model.eval() during test, if you directly use nn.Dropout2d(), it won't work as expected.

BBBBchan commented 1 year ago

Sorry for not being clear enough. The drop channel operation is involved during the training process. But it is only involved the forward process. No changes are made during the test time.

Specifically at the code level, first of all, from UniMatch to UniPerb, we did not change the deeplabv3plus.py file. In the unimatch.py file, we removed the content related to img_u_s2, and made the following changes to the model forward part:

# original
preds, preds_fp = model(torch.cat((img_x, img_u_w)), True)
pred_x, pred_u_w = preds.split([num_lb, num_ulb])
pred_u_w_fp = preds_fp[num_lb:]
pred_u_s1, pred_u_s2 = model(torch.cat((img_u_s1, img_u_s2))).chunk(2)

# Ours
preds, preds_fp = model(torch.cat((img_x, img_u_w)), True)
pred_x, pred_u_w = preds.split([num_lb, num_ulb])
pred_u_w_fp = preds_fp[num_lb:]
pred_u_s1= model(img_u_s1)

When calculating loss, the corresponding change is:

# Original
loss_x = criterion_l(pred_x, mask_x)

loss_u_s1 = criterion_u(pred_u_s1, mask_u_w_cutmixed1)
loss_u_s1 = loss_u_s1 * ((conf_u_w_cutmixed1 >= cfg['conf_thresh']) & (ignore_mask_cutmixed1 != 255))
loss_u_s1 = torch.sum(loss_u_s1) / torch.sum(ignore_mask_cutmixed1 != 255).item()

loss_u_s2 = criterion_u(pred_u_s2, mask_u_w_cutmixed2)
loss_u_s2 = loss_u_s2 * ((conf_u_w_cutmixed2 >= cfg['conf_thresh']) & (ignore_mask_cutmixed2 != 255))
loss_u_s2 = torch.sum(loss_u_s2) / torch.sum(ignore_mask_cutmixed2 != 255).item()

loss_u_w_fp = criterion_u(pred_u_w_fp, mask_u_w)
loss_u_w_fp = loss_u_w_fp * ((conf_u_w >= cfg['conf_thresh']) & (ignore_mask != 255))
loss_u_w_fp = torch.sum(loss_u_w_fp) / torch.sum(ignore_mask != 255).item()

loss = (loss_x + loss_u_s1 * 0.25 + loss_u_s2 * 0.25 + loss_u_w_fp * 0.5) / 2.0

# Ours
loss_x = criterion_l(pred_x, mask_x)

loss_u_s1 = criterion_u(pred_u_s1, mask_u_w_cutmixed1)
loss_u_s1 = loss_u_s1 * ((conf_u_w_cutmixed1 >= cfg['conf_thresh']) & (ignore_mask_cutmixed1 != 255))
loss_u_s1 = torch.sum(loss_u_s1) / torch.sum(ignore_mask_cutmixed1 != 255).item()

# loss_u_s2 = criterion_u(pred_u_s2, mask_u_w_cutmixed2)
# loss_u_s2 = loss_u_s2 * ((conf_u_w_cutmixed2 >= cfg['conf_thresh']) & (ignore_mask_cutmixed2 != 255))
# loss_u_s2 = torch.sum(loss_u_s2) / torch.sum(ignore_mask_cutmixed2 != 255).item()

loss_u_w_fp = criterion_u(pred_u_w_fp, mask_u_w)
loss_u_w_fp = loss_u_w_fp * ((conf_u_w >= cfg['conf_thresh']) & (ignore_mask != 255))
loss_u_w_fp = torch.sum(loss_u_w_fp) / torch.sum(ignore_mask != 255).item()

loss = (loss_x + loss_u_s1 * 0.25  + loss_u_w_fp * 0.5) / 2.0

Then, we tried to turn the code from UniPerb to FixMatch. I think the correct way is to modify both the forward and loss calculations like this:

# The correct way to modify
preds= model(torch.cat((img_x, img_u_w)))
pred_x, pred_u_w = preds.split([num_lb, num_ulb])
pred_u_s1= model(img_u_s1)

loss_x = criterion_l(pred_x, mask_x)

loss_u_s1 = criterion_u(pred_u_s1, mask_u_w_cutmixed1)
loss_u_s1 = loss_u_s1 * ((conf_u_w_cutmixed1 >= cfg['conf_thresh']) & (ignore_mask_cutmixed1 != 255))
loss_u_s1 = torch.sum(loss_u_s1) / torch.sum(ignore_mask_cutmixed1 != 255).item()

# loss_u_s2 = criterion_u(pred_u_s2, mask_u_w_cutmixed2)
# loss_u_s2 = loss_u_s2 * ((conf_u_w_cutmixed2 >= cfg['conf_thresh']) & (ignore_mask_cutmixed2 != 255))
# loss_u_s2 = torch.sum(loss_u_s2) / torch.sum(ignore_mask_cutmixed2 != 255).item()

# loss_u_w_fp = criterion_u(pred_u_w_fp, mask_u_w)
# loss_u_w_fp = loss_u_w_fp * ((conf_u_w >= cfg['conf_thresh']) & (ignore_mask != 255))
# loss_u_w_fp = torch.sum(loss_u_w_fp) / torch.sum(ignore_mask != 255).item()

loss = (loss_x + loss_u_s1 * 0.25 ) / 2.0

However, we make a "mistake" when modifying the code. We did not change the forward part and only the calculation and gradient return of loss_u_fp are commented out.

# Ours
preds, preds_fp = model(torch.cat((img_x, img_u_w)), True)
pred_x, pred_u_w = preds.split([num_lb, num_ulb])
pred_u_w_fp = preds_fp[num_lb:]
pred_u_s1= model(img_u_s1)

loss_x = criterion_l(pred_x, mask_x)

loss_u_s1 = criterion_u(pred_u_s1, mask_u_w_cutmixed1)
loss_u_s1 = loss_u_s1 * ((conf_u_w_cutmixed1 >= cfg['conf_thresh']) & (ignore_mask_cutmixed1 != 255))
loss_u_s1 = torch.sum(loss_u_s1) / torch.sum(ignore_mask_cutmixed1 != 255).item()

# loss_u_s2 = criterion_u(pred_u_s2, mask_u_w_cutmixed2)
# loss_u_s2 = loss_u_s2 * ((conf_u_w_cutmixed2 >= cfg['conf_thresh']) & (ignore_mask_cutmixed2 != 255))
# loss_u_s2 = torch.sum(loss_u_s2) / torch.sum(ignore_mask_cutmixed2 != 255).item()

# loss_u_w_fp = criterion_u(pred_u_w_fp, mask_u_w)
# loss_u_w_fp = loss_u_w_fp * ((conf_u_w >= cfg['conf_thresh']) & (ignore_mask != 255))
# loss_u_w_fp = torch.sum(loss_u_w_fp) / torch.sum(ignore_mask != 255).item()

loss = (loss_x + loss_u_s1 * 0.25 ) / 2.0

This modification is not standard, because the drop channel operation is performed during the forward process, but it is not used when calculating the loss function. However, under this "wrong" setting, the model performance reached 73.9 while the "correct" setting only achieved 63.3.

In my understanding, the difference between these two methods is only whether the drop channel part participates in the forward process. The impact of this difference on the network should be limited to BatchNorm, and should at least not lead to any performance improvement. I think this phenomenon is counter-intuitive.

LiheYoung commented 1 year ago

Thanks. I get the point now.

First, there is a difference in the loss weight between our UniPerb & FixMatch and yours. Specifically, to fairly maintain the same learning rate across different methods, we always keep the total weight of unsupervised losses as 1. Therefore, in UniPerb, the loss is loss = (loss_x + loss_u_s1 * 0.5 + loss_u_w_fp * 0.5) / 2.0, while in FixMatch, the loss is loss = (loss_x + loss_u_s1) / 2.0.

Second, I agree with you that the forward Dropout only affects the BatchNorm statistics of the decoder. But intuitively, the Dropout features will not even affect the BN in a positive direction. Anyway, considering the setting of 92 labeled images is relatively unstable and may be affected by many factors, can you give the results of other splits? such as 366 and 1464 splits.

BBBBchan commented 1 year ago

Thanks. As for the first point, I do not consider the loss weight would have such a huge impact. In fact, the coefficients of loss function between our UniPerb and FixMatch are consistent (loss = (loss_x + loss_u_s1 * 0.25 ) / 2.0). Although, we will use different loss weights in future experiments.

For the second point, I will conduct experiments on 366 and 1464 splits to verify the existence of this phenomenon.

Besides, we conduct a "radical" experiment of changing the drop ratio from 0.5 to 1.0. That means concating some zero channels with original features. This setting brings 66 miou. Although not as good as 73.9 when setting 0.5, it still brings a performance improvement compared to Fixmatch (63.3).

LiheYoung commented 1 year ago

Hi,

Sorry for the late reply. The radical experiment is so counter-intuitive. Are there any updates on the results of other splits?

BBBBchan commented 1 year ago

Hi. I have tried using different loss weights. Although performance has improved somewhat (63.3->64.0 73.3->73.8), the similar phenomenon is still observed (73.6). It seems that the forward progress of drop channel is critical. Similar performance to UniPerb can be obtained without backward progress.

As for other splits, this phenomenon is still present under the 183 split but is not noticeable under the 1464 split. It seems that as the amount of labeled data increases, this phenomenon will be weakened.

LiheYoung commented 1 year ago

Thank you very much for your results.

Given your provided results, I currently come up with a potential explanation for this counter-intuitive phenomenon.

The forward process of Dropout (the dropping rate may be 50% or 100%) only affects the BN statistics (in a negative way), therefore, the dropped feature maps may serve as a strong perturbation to the BN statistics. In extremely low-data regimes, such as only 92 or 183 labeled images, such a perturbation is helpful via alleviating the overfitting problem. However, as your observed, in high-data regimes, such as 1464 images, such a BN perturbation is too strong and will affect the normal learning process.