Closed PkuRainBow closed 6 years ago
Hi @PkuRainBow, thanks a lot for your interest! I am happy to hear about your result when combining the two losses.
I believe what you observe is mainly due to optimization. Our loss is particularely adapted for fine-tuning, as detailed in the FAQ in the main readme. Note that our VOC-DeepLab experiments are also a form of fine-tuning since they are initalized from the MS-COCO weights of the authors.
Fine-tuning has also a computational advantage, since our loss is slower to compute (O(p log p) complexity) - altough a dedicated CUDA kernel would likely significatively speed up our current implementation.
Combining the two losses can also steer the learning process. Besides just adding the losses, you could also do a weighted sum of the two losses, and decrease the weight of the cross-entropy loss throughout the optimization.
Other optimization-related aspect: while we found that keeping the same learning rate was generally good enough, I assume that doing more hyperparameter tuning would be beneficial.
Besides optimization, one other possible negative effect is that for smaller batches our loss optimizes something closer to image-IoU than dataset-IoU, as we discuss in section 3.1, and this can lead to a decrease of the dataset-IoU in the end. Combining with cross-entropy can also help here, preventing to specialize too closely to image-IoU.
Other approaches to tackle these limitations and the optimization are left to future work.
@bermanmaxim Thanks for your help.
I will try to finetune the models with LovaszSoftmax loss firstly. As for other approaches, it is a little bit expensive to exploit better hyperparameters.
As for the combination of the two loss functions. I simply choose 1:1. Maybe other combinations can further improve the performance.
Besides, great thanks to your other advice.
You're welcome, my pleasure! For now I'll keep this issue open for visibility.
Hi Maxim,
I too noticed that Lovasz Softmax works better in combination with cross-entropy. I used 0.5 and 0.9 weights for Lovasz Softmax in the weighted sum, and as long as the Lovasz Softmax weight was <1.0 the difference between 0.5 and 0.9 was not evident.
You say it's due to more local minima in Lovasz Softmax. Is this intuition or has a theory behind?
Thank you.
@alexander-rakhlin Could you share your results with different weights combinations? I only tried 1:1.
Here I share the single crop results on the validations set of cityscapes.
baseline with softmax loss:
{'IU_array': array([0.98016613, 0.84412102, 0.9267689 , 0.62228906, 0.61643986,
0.63782245, 0.69795884, 0.78631591, 0.92559306, 0.66367732,
0.94683498, 0.82639853, 0.65797452, 0.95216736, 0.8073407 ,
0.85389642, 0.63814378, 0.68186497, 0.77889023])}
baseline with both softmax loss and lovasz softmax loss(1:1):
{ 'IU_array': array([0.97955327, 0.84097007, 0.92471465, 0.52840852, 0.62658249,
0.65854956, 0.71770889, 0.8115076 , 0.92476787, 0.65601253,
0.94814198, 0.83470563, 0.6706206 , 0.95355899, 0.8312481 ,
0.88661456, 0.71990126, 0.70689474, 0.78928316])}
We find that the improvements on some classes are very obvious. But there also exist some classes' performance drop largely. Very interesting.
So, I will try to finetune the baseline model with only the lovasz loss and report the related results latter.
@bermanmaxim Could you explain me the words "Other optimization-related aspect: while we found that keeping the same learning rate was generally good enough, I assume that doing more hyperparameter tuning would be beneficial."?
I checked your paper and I guessed that it means that we just replace the softmax loss with the lovasz-softmax loss and train the models with the same settings as previous one.
@PkuRainBow I tried 0.5 and 0.9 (1:1 and 9:1), and the results in my task were seemingly irrelevant to the combination weights.
Hi @PkuRainBow, @alexander-rakhlin,
Thanks again for your interest, I'm looking forward to see these kind of contributions with finding good ways to train and combine our loss. At this point these questions are mostly experimental & based on intuition rather than theoretically founded.
@PkuRainBow your interpretation of that sentence is correct, I mean we mostly kept the training parameters but there could be more gains to be made with more hyperparameter exploration.
When I mentioned weights combining I talked about the possibility of a dynamic weighting, with for instance lambda cross-entropy + (1 - lambda) lovasz-softmax; by changing lambda from 1 to 0 accross epochs of the optimization you could likely benefit from both the combination and the "soft-finetuning" aspects of the loss.
@bermanmaxim Sorry to inform you that finetuning the cross-entropy based model will harm the performance according to my current experiments.
@PkuRainBow it can be due to a combination of various factors; for instance optimization questions, or smaller batch sizes as underlined in the last paragraph of my first comment
Besides optimization, one other possible negative effect is that for smaller batches our loss optimizes something closer to image-IoU than dataset-IoU, as we discuss in section 3.1, and this can lead to a decrease of the dataset-IoU in the end. Combining with cross-entropy can also help here, preventing to specialize too closely to image-IoU.
I will close this thread as it is not an issue anymore but more of an extended scientific discussion. Happy to see usage of the loss and improvements - at least in combination for easier optimization.
@bermanmaxim
Hi, Is the key to use lovasz_softmax is to first fully train the model with normal cross entropy loss, and then finetune the model with lovasz_softmax ? Or should we roughly train the model with normal cross entropy and mainly depend on the lovasz_softmax to make the model converge better ?
Really interesting work. I have a baseline with softmax loss with the Deeplabv3 and achieve mIOU=76.7 on the Cityscapes.
And I simply replace the cross entropy loss with your proposed loss and train the models with the same learning rate and weight decay, but I only achieve mIOU=64.7.
Could you give me some hint?
I also notice that you also do not train the ENet from scratch and you just finetune the models.
Besides, I also conduct a small experiments to train the models with both the cross entropy loss and your proposed loss, which achieves a good performance: mIOU=78.4.
It would be great if you could share me your advice!