Alibaba-MIIL / ASL

Official Pytorch Implementation of: "Asymmetric Loss For Multi-Label Classification"(ICCV, 2021) paper
MIT License
732 stars 102 forks source link

In my dataset, the loss of ALS is very large, and it is normal to use other loss functions #22

Closed ghost closed 3 years ago

ghost commented 3 years ago

Hello, thank you very much and your team's contribution in this respect, I intend to apply this loss function to my image multi label classification model (only label, no border label), loss_function=AsymmetricLoss() logits = net(images.to(device)) loss = loss_function(logits,labels.to(device)) I haven't changed your ALS loss function at all. At first, the loss was 156. Finally, it dropped to 4, ACC = 0. What's the matter? Why did the loss value just start to be more than 100, and still be 4 after training, and the accuracy rate is zero?When I use BCEloss, it's perfectly normal train loss: 100%[**->]4.9414 [epoch 1] train_loss: 21.409 test_accuracy: 0.000 train loss: 100%[**->]5.7753

mrT23 commented 3 years ago

our default params for ASL are for highly imbalanced multi label datasets.

i suggest you try gradually variants of ASL, and make sure results are logical and consistent

(1) start with simple CE, and make sure you reproduce your BCEloss results: loss_function=AsymmetricLoss(gamma_neg=0, gamma_pos=0, clip=0)

(2) than try simple focal loss: loss_function=AsymmetricLoss(gamma_neg=2, gamma_pos=2, clip=0)

(3) try now ASL: loss_function=AsymmetricLoss(gamma_neg=2, gamma_pos=1, clip=0) loss_function=AsymmetricLoss(gamma_neg=4, gamma_pos=1, clip=0.05)

(4) also try the 'disable_torch_grad_focal_loss' mode, it can stabilize results: loss_function=AsymmetricLoss(gamma_neg=4, gamma_pos=1, clip=0.05,disable_torch_grad_focal_loss=True)

ghost commented 3 years ago

Hello, Thank you for your reply

I used a simple example to test and found that BCEloss can't be reproduced. What's the problem? from losses import AsymmetricLossOptimized,AsymmetricLoss

import torch import numpy as np import torch.nn.functional as F pred = np.array([[-0.4089, -1.2471, 0.5907], [-0.4897, -0.8267, -0.7349], [0.5241, -0.1246, -0.4751]]) label = np.array([[0, 1, 1], [0, 0, 1], [1, 0, 1]])

pred = torch.from_numpy(pred).float() label = torch.from_numpy(label).float()

crition1 = torch.nn.BCEWithLogitsLoss() loss1 = crition1(pred, label) print(loss1)

crition2 = AsymmetricLoss(gamma_neg = 0,gamma_pos = 0,clip = 0,disable_torch_grad_focal_loss=True)

loss2 = crition2(pred, label) print(loss2)

crition3 = AsymmetricLossOptimized(gamma_neg = 0,gamma_pos = 0,clip = 0) loss3 = crition3(pred, label) print(loss3)


tensor(0.7193) tensor(6.4739) tensor(6.4739)

mrT23 commented 3 years ago

ASL preforms sigmoid

BCEWithLogitsLoss does no perform sigmoid

ghost commented 3 years ago

Thank you sincerely for your help. I have solved this problem. In addition, I would like to ask my multi label image task (there are nine kinds of tags in total, each picture may have one or two, three, four kinds of tags,There is no dependency between these tags ). Is this imbalance described in your paper? Can I use your loss function for this task?

ghost commented 3 years ago

Sincerely thank you for taking some time out of your busy work to answer this question. I am a deep learning beginner.In your article:"In typical multi label datasets, each picture contains only a few positive labels, and many negative ones.", ”In my multi label classification dataset, there are ten kings of tags in total, each picture may have one or two, three, four kings of tags. Does this not be too extreme, also belong to the situation mentioned in your article, can I use ASL?

mrT23 commented 3 years ago

I am not sure. my best advice would be "try and see".

the datasets that we used in the article are probably larger than yours. however, loss function is one the of critical components in deep learning, and you would do wisely to try and find the best one for your problem.

This is an integral part of the way experienced deep learning practitioners reach top results - they test many things, and look for thee "big money". proper loss can be one of those things, although your specific problem might indeed not be the best candidate for ASL.

ghost commented 3 years ago

OK, thank you for your help

davidas1 commented 3 years ago

ASL preforms sigmoid

BCEWithLogitsLoss does no perform sigmoid

Thought it would be good to clarify something, as this issue is linked in the repo's README - both loss functions mentioned above perform Sigmoid internally. The difference between the results is due to different reduction - BCEWithLogitsLoss does mean reduction by default, while ASL always returns the sum.

@mrT23 - Do you have any intuition about why you sum the loss instead of averaging? This should make the loss (and other hyperparameters like learning rate) dependant on the batch size and number of classes. I'm trying ASL on a multi-task multi-label problem (training multiple heads, each with its own loss), and thinking about what is the best way to reduce the losses from the different heads.

mrT23 commented 3 years ago

@davidas1 i was bothered with this question for ~1 year (on other losses as well) until I realized the following truth - In Adam optimizer, it does not matter if we do sum or average!

you can understand this just from looking at adam update rule: image https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c

if you still ponder about it, i can further explain

mrT23 commented 3 years ago

since in adam optimizer you divide gradient by the standard deviation, the actual update is not changed if you multiply or divide the loss by a constant factor (sum vs avg)

On Mon, Mar 8, 2021 at 2:15 PM bendanzzc notifications@github.com wrote:

Adam o

Could you further explain why it does not matter if we do sum or average? Thanks a lot

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/Alibaba-MIIL/ASL/issues/22#issuecomment-792716549, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBXQDFQRA3WVH7FXD2E3HLTCS5VVANCNFSM4UJBVXLA .

Chen-Song commented 3 years ago

@davidas1 Hi, I see you say 'I'm trying ASL on a multi-task multi-label problem (training multiple heads, each with its own loss), and thinking about what is the best way to reduce the losses from the different heads'. Is this strategy effective for multi-task multi-label problem?

csEylLee commented 1 year ago

ASL preforms sigmoid

BCEWithLogitsLoss does no perform sigmoid

I think you were wrong. BCEWithLogitsLoss also performs sigmoid. When I set reduction='sum', the output loss of BCEWithLogitsLoss is equal to ASL.

YUNIyx commented 8 months ago

@mrT23 I have tried gamma_neg=2, gammapos=1 and gamma neg = 4, gamma pos = 1. The latter is better, but it is still not as good as the cross entropy loss function. If it is modified to gamma neg = 5 and gamma _ pos = 1, will it have a better effect?