Small concerns on the experiments in Table-4

PkuRainBow commented 4 years ago

Really nice work and super impressive results on the low-data regime (shown in Table 4, also pasted as following)!!

We have some small concerns about the red circle marked results:

On the 1/100 subset column, the baseline of DeepLabv3+/PSPNet is only 37.95%/36.69% while your method achieves 59.52%/67.20% separately. We really appreciate you if you could provide some detailed explanation of why your method could achieve so huge gains! One of our small concerns is that applying the CutMix scheme + Mean-Teacher scheme on the supervised baseline method w/o using unlabeled images might be a more reasonable baseline setting. It would be great if could share with us some results of such settings.
On the 1/8 subset column, the baseline performance of DeepLabv3+ is slightly worse than the PSPNet. According to our experience, the DeepLabv3+ should perform much better. Could you share with us some explanation on it?
On the full set column, we observe that the performance of the proposed method is slightly worse than the baseline. Could you share your comments on the possible reasons?
According to your code, all the experiments fix the BN statics and apply crop size 321x321 with a single GPU. Do you have any plans to train or have you ever trained your method on a more strong setting such as with crop size 512x512 + SyncBN + 8x V100 GPUs. MMSegmentation or openseg.pytorch might be a good candidate codebase.

Great thanks for your valuable time and wait for your explanation!

Britefury commented 4 years ago

Hi, thanks for your interest in our project! I'll do my best to answer your questions.

a. I speculate that we achieve good gains on low data regimes as the baseline experiments overfit significantly with such little training data. In fact, we often note that the maximum score is achieved after the first 1000 iterations and drops after that, indicating overfitting. We use the same number of total iterations for all of our experiments for consistency. Perhaps that's not the best way of doing things.
b. Our supervised baseline experiments do in fact use mean teacher. We just skip the unsupervised loss, so its as close as possible to the semi-supervised experiments. Perhaps not using mean teacher could be better.
Unfortunately I don't know.
On occasion, we get a slightly lower result when applying semi-supervised learning in a fully supervised setting. If ground truth labels are available for all samples, it would seem that the unsupervised loss term does not help at all.
a. I seem to recall that enabling BN reduced performance a bit, hence fixing the BN stats. As for the crop size, we decided to stick with the scheme of Hung18 and Mittal19 in order to provide an apples to apples comparison. Other crop sizes could indeed improve performance.
b. We don't have any plans to do any multi-GPU training. We would need to modify our code a fair bit for that.

I hope that helps.

PkuRainBow commented 4 years ago

Great thanks for your so detailed explanation!

Britefury / cutmix-semisup-seg