Britefury / cutmix-semisup-seg

Semi-supervised semantic segmentation needs strong, varied perturbations
MIT License
162 stars 24 forks source link

Question about the cons weight. #10

Closed CuberrChen closed 2 years ago

CuberrChen commented 2 years ago

Hi. In mean teacher, the consistency weight is 100. but in this work, all consistency weight is 1. Isn't this value too small? Can you tell me the details of setting this parameter, because I have seen other work(such as CPS, cvpr2021) that uses a consistency weight of around 100 when reproducing the mean-teacher method as well.

Looking forward to your help.

best,

Britefury commented 2 years ago

Hi,

There are a few factors that I can think of. For a start, the original mean teacher averages the MSE consistency loss over the class dimension:

https://github.com/CuriousAI/mean-teacher/blob/546348ff863c998c26be4339021425df973b4a36/pytorch/mean_teacher/losses.py#L27

Note that size_average=False results in F.mse_loss computing the sum of the MSE loss, followed by a division by num_classes, then:

https://github.com/CuriousAI/mean-teacher/blob/546348ff863c998c26be4339021425df973b4a36/pytorch/main.py#L263

in which the consistency loss is divided by the mini-batch size, so they compute the average of the mse loss over all dimensions.

Given that the mean teacher paper stated that you have to scale the consistency loss weight with the number of classes (I seem to recall), we figured that we would sum over the class dimension and use the same consistency weight all over. To account for this on a 10-class dataset, divide the loss weight by 10.

Now take a look at:

https://github.com/CuriousAI/mean-teacher/blob/546348ff863c998c26be4339021425df973b4a36/pytorch/experiments/cifar10_test.py#L35

They draw a batch of 128 unsupervised samples for each batch of 31 supervised samples, so a ratio of 4:1. We on the other hand use a ratio of 1:1. If we consider that when training ImageNet networks using large batch sizes (e.g. 1024 or more) people tend to scale the learning rate linearly with the batch size, it makes sense that you would need a 4x higher consistency loss weight when using 4x the unsupervised samples per batch. So this accounts for a further 4x difference.

Accounting for these differences, they consistency loss is 'in effect' 2.5x higher than ours. From the results of my parameter sweeps, I don't recall this making a huge difference.

I hope this helps.

Kind regards

Geoff

On Sun, 9 Jan 2022 at 07:26, xbchen @.***> wrote:

Hi. In mean teacher, the consistency weight is 100. but in this work, all consistency weight is 1. Isn't this value too small? Can you tell me the details of setting this parameter, because I have seen other work(such as CPS, cvpr2021) that uses a consistency weight of around 100 when reproducing the mean-teacher method as well.

Looking forward to your help.

best,

— Reply to this email directly, view it on GitHub https://github.com/Britefury/cutmix-semisup-seg/issues/10, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG3D7TOA7CNRLXRPW2WKZTUVE2BLANCNFSM5LRRFPBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

CuberrChen commented 2 years ago

Thank you very much for your detailed reply!

Since when I use large consistency loss weight, mean-teacher does not work properly and instead leads to degradation of segmentation performance.

Your answer is very helpful for me to understand it.

Best,

Britefury commented 2 years ago

Glad I could help. It's often worth doing a manual sweep on these hyper-parameters in order to find the best value. For loss weights, I sweep on an exponential scale, so I try perhaps the following values: 0.01, 0.03, 0.1, 0.3, 1.0, 3.0, 10.0, 30.0, 100.0. Close enough not to miss an optimal value while allowing you to cover a range. That's pretty much what I did to find the values that we use here.

CuberrChen commented 2 years ago

Thanks.