bermanmaxim / LovaszSoftmax

Code for the Lovász-Softmax loss (CVPR 2018)
http://bmax.im/LovaszSoftmax
MIT License
1.38k stars 269 forks source link

Fail to improve the performance if I train the model from scratch. #5

Closed PkuRainBow closed 6 years ago

PkuRainBow commented 6 years ago

Really interesting work. I have a baseline with softmax loss with the Deeplabv3 and achieve mIOU=76.7 on the Cityscapes.

And I simply replace the cross entropy loss with your proposed loss and train the models with the same learning rate and weight decay, but I only achieve mIOU=64.7.

Could you give me some hint?

I also notice that you also do not train the ENet from scratch and you just finetune the models.

Besides, I also conduct a small experiments to train the models with both the cross entropy loss and your proposed loss, which achieves a good performance: mIOU=78.4.

It would be great if you could share me your advice!

bermanmaxim commented 6 years ago

Hi @PkuRainBow, thanks a lot for your interest! I am happy to hear about your result when combining the two losses.

I believe what you observe is mainly due to optimization. Our loss is particularely adapted for fine-tuning, as detailed in the FAQ in the main readme. Note that our VOC-DeepLab experiments are also a form of fine-tuning since they are initalized from the MS-COCO weights of the authors.

Fine-tuning has also a computational advantage, since our loss is slower to compute (O(p log p) complexity) - altough a dedicated CUDA kernel would likely significatively speed up our current implementation.

Combining the two losses can also steer the learning process. Besides just adding the losses, you could also do a weighted sum of the two losses, and decrease the weight of the cross-entropy loss throughout the optimization.

Other optimization-related aspect: while we found that keeping the same learning rate was generally good enough, I assume that doing more hyperparameter tuning would be beneficial.

Besides optimization, one other possible negative effect is that for smaller batches our loss optimizes something closer to image-IoU than dataset-IoU, as we discuss in section 3.1, and this can lead to a decrease of the dataset-IoU in the end. Combining with cross-entropy can also help here, preventing to specialize too closely to image-IoU.

Other approaches to tackle these limitations and the optimization are left to future work.

PkuRainBow commented 6 years ago

@bermanmaxim Thanks for your help.

I will try to finetune the models with LovaszSoftmax loss firstly. As for other approaches, it is a little bit expensive to exploit better hyperparameters.

As for the combination of the two loss functions. I simply choose 1:1. Maybe other combinations can further improve the performance.

Besides, great thanks to your other advice.

bermanmaxim commented 6 years ago

You're welcome, my pleasure! For now I'll keep this issue open for visibility.

alexander-rakhlin commented 6 years ago

Hi Maxim,

I too noticed that Lovasz Softmax works better in combination with cross-entropy. I used 0.5 and 0.9 weights for Lovasz Softmax in the weighted sum, and as long as the Lovasz Softmax weight was <1.0 the difference between 0.5 and 0.9 was not evident.

You say it's due to more local minima in Lovasz Softmax. Is this intuition or has a theory behind?

Thank you.

PkuRainBow commented 6 years ago

@alexander-rakhlin Could you share your results with different weights combinations? I only tried 1:1.

PkuRainBow commented 6 years ago

Here I share the single crop results on the validations set of cityscapes.

baseline with softmax loss:

 {'IU_array': array([0.98016613, 0.84412102, 0.9267689 , 0.62228906, 0.61643986,
        0.63782245, 0.69795884, 0.78631591, 0.92559306, 0.66367732,
        0.94683498, 0.82639853, 0.65797452, 0.95216736, 0.8073407 ,
        0.85389642, 0.63814378, 0.68186497, 0.77889023])}

baseline with both softmax loss and lovasz softmax loss(1:1):

{ 'IU_array': array([0.97955327, 0.84097007, 0.92471465, 0.52840852, 0.62658249,
       0.65854956, 0.71770889, 0.8115076 , 0.92476787, 0.65601253,
       0.94814198, 0.83470563, 0.6706206 , 0.95355899, 0.8312481 ,
       0.88661456, 0.71990126, 0.70689474, 0.78928316])}

We find that the improvements on some classes are very obvious. But there also exist some classes' performance drop largely. Very interesting.

So, I will try to finetune the baseline model with only the lovasz loss and report the related results latter.

PkuRainBow commented 6 years ago

@bermanmaxim Could you explain me the words "Other optimization-related aspect: while we found that keeping the same learning rate was generally good enough, I assume that doing more hyperparameter tuning would be beneficial."?

I checked your paper and I guessed that it means that we just replace the softmax loss with the lovasz-softmax loss and train the models with the same settings as previous one.

alexander-rakhlin commented 6 years ago

@PkuRainBow I tried 0.5 and 0.9 (1:1 and 9:1), and the results in my task were seemingly irrelevant to the combination weights.

bermanmaxim commented 6 years ago

Hi @PkuRainBow, @alexander-rakhlin,

Thanks again for your interest, I'm looking forward to see these kind of contributions with finding good ways to train and combine our loss. At this point these questions are mostly experimental & based on intuition rather than theoretically founded.

@PkuRainBow your interpretation of that sentence is correct, I mean we mostly kept the training parameters but there could be more gains to be made with more hyperparameter exploration.

When I mentioned weights combining I talked about the possibility of a dynamic weighting, with for instance lambda cross-entropy + (1 - lambda) lovasz-softmax; by changing lambda from 1 to 0 accross epochs of the optimization you could likely benefit from both the combination and the "soft-finetuning" aspects of the loss.

PkuRainBow commented 6 years ago

@bermanmaxim Sorry to inform you that finetuning the cross-entropy based model will harm the performance according to my current experiments.

bermanmaxim commented 6 years ago

@PkuRainBow it can be due to a combination of various factors; for instance optimization questions, or smaller batch sizes as underlined in the last paragraph of my first comment

Besides optimization, one other possible negative effect is that for smaller batches our loss optimizes something closer to image-IoU than dataset-IoU, as we discuss in section 3.1, and this can lead to a decrease of the dataset-IoU in the end. Combining with cross-entropy can also help here, preventing to specialize too closely to image-IoU.

I will close this thread as it is not an issue anymore but more of an extended scientific discussion. Happy to see usage of the loss and improvements - at least in combination for easier optimization.

CoinCheung commented 3 years ago

@bermanmaxim

Hi, Is the key to use lovasz_softmax is to first fully train the model with normal cross entropy loss, and then finetune the model with lovasz_softmax ? Or should we roughly train the model with normal cross entropy and mainly depend on the lovasz_softmax to make the model converge better ?