ModelTC / United-Perception

United Perception
Apache License 2.0
426 stars 65 forks source link

where does the AlphAction model use the HOOK mechanism;softmax classification #59

Closed yan-ctrl closed 1 year ago

yan-ctrl commented 1 year ago

Hello, I have also encountered a problem with my own EFL, which is to calculate the gradient manually. After reading your answers to several questions, you suggest using HOOK to collect the gradient, and I would like to ask you a few questions:

  1. Generally, where is the gradient collected in the model? Is the target detection or classification at the last layer of the classifier sign?
  2. For example, where does the AlphAction model use the HOOK mechanism?
  3. Can you use the formula like EQLv2 to calculate the gradient of positive and negative samples, like the paper you mentioned?
  4. For softmax classification, do you only need to pay attention to the gradient with one-hot of 1, and the gradient of 0 can be ignored?
waveboo commented 1 year ago

Hi @yan-ctrl, thanks for your questions. For questions 1&2: the answer is yes. The gradient is collected at the input of the loss function (or the output of the last layer of the model, they are equivalent). All models follow this rule including the AlphAction. For question 3, since many issues ask me for a manual formula of the gradient calculation, I can provide some pseudocode here. Note that the manual calculation and the hook calculation are mutually exclusive, you can only choose one to use.

def collect_grad_manual(self, input, target, focus_factor, normalizer):
    prob = torch.sigmoid(input)
    pos_grad = target * torch.pow((1 - prob), focus_factor) * (focus_factor * prob * torch.log(prob) + prob - 1) * focus_factor / self.focal_gamma
    neg_grad = (target - 1) * torch.pow(prob, focus_factor) * (focus_factor * (1 - prob) * torch.log(1 - prob) - prob) * focus_factor / self.focal_gamma
    if self.focal_alpha >= 0:
        pos_grad = pos_grad * self.focal_alpha
        neg_grad = neg_grad * (1 - self.focal_alpha)
    grad = pos_grad + neg_grad
    grad = torch.abs(grad / normalizer)
    pos_grad = torch.sum(grad * target, dim=0)
    neg_grad = torch.sum(grad * (1 - target), dim=0)
    allreduce(pos_grad)
    allreduce(neg_grad)
    self.pos_grad += pos_grad
    self.neg_grad += neg_grad
    self.pos_neg = torch.clamp(self.pos_grad / (self.neg_grad + 1e-10), min=0, max=1)

For question 4, the softmax loss function only calculates at the one-hot of 1, but the gradient is backward to all classes because different categories are competitive in softmax. Therefore, the definition of positive and negative gradients still holds. And the softmax version of the equalization loss could be found here.

yan-ctrl commented 1 year ago

嗨@yan-ctrl,感谢您的提问。对于问题1和2:答案是肯定的。梯度是在损失函数的输入(或模型最后一层的输出,它们是等效的)处收集的。所有模型都遵循此规则,包括AlphAction。对于问题 3,由于许多问题要求我手动提供梯度计算公式,我可以在这里提供一些伪代码。请注意,手动计算和钩子计算是互斥的,您只能选择一个使用。

def collect_grad_manual(self, input, target, focus_factor, normalizer):
    prob = torch.sigmoid(input)
    pos_grad = target * torch.pow((1 - prob), focus_factor) * (focus_factor * prob * torch.log(prob) + prob - 1) * focus_factor / self.focal_gamma
    neg_grad = (target - 1) * torch.pow(prob, focus_factor) * (focus_factor * (1 - prob) * torch.log(1 - prob) - prob) * focus_factor / self.focal_gamma
    if self.focal_alpha >= 0:
        pos_grad = pos_grad * self.focal_alpha
        neg_grad = neg_grad * (1 - self.focal_alpha)
    grad = pos_grad + neg_grad
    grad = torch.abs(grad / normalizer)
    pos_grad = torch.sum(grad * target, dim=0)
    neg_grad = torch.sum(grad * (1 - target), dim=0)
    allreduce(pos_grad)
    allreduce(neg_grad)
    self.pos_grad += pos_grad
    self.neg_grad += neg_grad
    self.pos_neg = torch.clamp(self.pos_grad / (self.neg_grad + 1e-10), min=0, max=1)

对于问题 4,softmax 损失函数仅在 1 中的 one-hot 处计算,但梯度向后到所有类别,因为不同的类别在 softmax 中具有竞争力。因此,正梯度和负梯度的定义仍然成立。均衡损耗的softmax版本可以在这里找到。 Thank you very much for your patient answer, which is very helpful to me. The pseudocode is very clear. I will try to apply it to my own tasks. I have also read the paper you recommended, so I have another question to ask you, for the Focusing Factor of Softmax-EFL, whether Gj only uses the cumulative gradient of positive samples of each category, and whether it is necessary to substitute the cumulative gradient of positive samples into the softmax formula to recalculate the prediction probability pj like Softmax-EQL?

yan-ctrl commented 1 year ago

Thank you very much for your patient answer, which is very helpful to me. The pseudocode is very clear. I will try to apply it to my own tasks. I have also read the paper you recommended, so I have another question to ask you, for the Focusing Factor of Softmax-EFL, whether Gj only uses the cumulative gradient of positive samples of each category, and whether it is necessary to substitute the cumulative gradient of positive samples into the softmax formula to recalculate the prediction probability pj like Softmax-EQL?

waveboo commented 1 year ago

@yan-ctrl, EFL (Equalized Focal Loss) is a generalized version of the sigmoid focal loss. We have not yet proposed a softmax version EFL, and you could try it by yourself. Meanwhile, I have a little tip could give you: If the rank between categories is crucial in your task (like image classification task must choose the argmax prediction of an instance), applying a softmax-like loss function with the cumulative gradient of positive samples or negative samples is better. Otherwise (like object detection task calculate the mAP of each category independently), you could choose a sigmoid-like loss function and use the gradient ratio as the long-tail indicator.

yan-ctrl commented 1 year ago

Thank you for your patience. I see

yan-ctrl commented 1 year ago

Hello, when I apply sigmoid-EFL to yolov5, I replace the classification loss of yolov5 with sigmoid-EFL, the training batch Size is 32, lr is 0.01, single card training, RTX3080, set super parameter fl The gamma is 1.5, and the super parameter fl is added_ Alpha=0.25. My code is shown below. However, when training to the second epoch, the classification loss appears a nan value. As shown below: nanphoto

Training Log: Epoch gpu_mem box obj cls labels img_size 0/299 6.45G 0.05136 0.01066 0.004802 430 640: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.02it/s] val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████| 128/128 [00:10<?, ?it/s] Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.43s/it] all 128 929 0.662 0.157 0.189 0.0993

 Epoch   gpu_mem       box       obj       cls    labels  img_size
 1/299     8.08G   0.04997  0.009346  0.003714       468       640: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.17it/s]
           Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.18s/it]
             all        128        929      0.454      0.151      0.141     0.0736

 Epoch   gpu_mem       box       obj       cls    labels  img_size
 2/299     8.08G   0.04928  0.008183  0.003707       404       640: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.14it/s]
           Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.29s/it]
             all        128        929      0.498      0.143      0.155     0.0756

 Epoch   gpu_mem       box       obj       cls    labels  img_size
 3/299     8.08G   0.04829  0.009446       nan       440       640: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.26it/s]
           Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00,  5.54s/it]
             all        128        929      0.472      0.145      0.157     0.0783

 Epoch   gpu_mem       box       obj       cls    labels  img_size
 4/299     8.08G   0.04913  0.009771       nan       541       640: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.30it/s]
           Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|███████████████████████████████

I have tried to turn the batch down Size and learning rate, but it has no effect. We also tried to add a small value of 1e-10 in the torch. log() operation and torch. pow() operation, but it still has no effect. When I use torch.autograd.set detect Anonymous (True) and with torch.autograd.detect anomaly():

When scaler. scale (loss). backward() is placed at the beginning of the training code and backpropagation, the following errors are reported respectively: 1、[W python_anomaly_mode.cpp:104] Warning: Error detected in MulBackward0. Traceback of forward call that caused the error: File "/root/yolov5-6.0/utils/loss.py", line 157, in forward efl = ce_loss torch.pow((1-pred_t), ff.detach()) wf.detach() (function _print_stack) 3/299 8.08G 0.04854 0.01028 0.003171 448 640: 25%|██

2、File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward Variable._execution_engine.run_backward( RuntimeError: Function 'MulBackward0' returned nan values in its 0th output.

I haven't been able to solve it. Could you help me to see what problems exist in the code, or give me some suggestions.

class sigmoidEFL(nn.Module): def init(self, loss_fcn, focal_gamma_b=1.5, focal_alpha=0.25, loss_weight=1.0, ignore_index=-1, num_classes=80, scale_factor=8.0, indicator='pos_and_neg'): super(sigmoidEFL,self).init() self.loss_fcn = loss_fcn self.reduction = loss_fcn.reduction self.loss_fcn.reduction = 'none' self.num_classes = num_classes self.loss_weight = loss_weight self.ingnore_index = ignore_index #

cfg for focal loss

    self.focal_gamma = focal_gamma_b
    self.focal_alpha = focal_alpha

    # # ignore bg class and ignore idx
    # self.num_classes = num_classes - 1

    # cfg for efl loss
    self.scale_factor = scale_factor

    assert indicator in ['pos', 'neg', 'pos_and_neg'], 'Wrong indicator type!'
    self.indicator = indicator

    # initial variables
    self.register_buffer('pos_grad', torch.zeros(self.num_classes).to(device_efl))
    self.register_buffer('neg_grad', torch.zeros(self.num_classes).to(device_efl))
    self.register_buffer('pos_neg', torch.ones(self.num_classes).to(device_efl))

    # grad collect
    # self.grad_buffer = [] # don't understand

def forward(self, logits, label): #normalizer=None
    if self.indicator == 'pos':
        indicator = self.pos_grad.detach()
        # indicator = self.grad_buffer[0]
    elif self.indicator == 'neg':
        indicator = self.neg_grad.detach()
    elif self.indicator == 'pos_and_neg':
        indicator = self.pos_neg.detach()
        # indicator = self.pos_neg.detach() + self.neg_grad.detach()
    else:
        raise NotImplementedError
    self.n_c = logits.shape[-1]
    self.logits = logits.reshape(-1, self.n_c)
    self.n_i, _ = self.logits.size()

    #one-hot
    def expand_label(pred, gt_classes):
        target = pred.new_zeros(self.n_i, self.n_c)
        target[torch.arange(self.n_i), gt_classes] = 1
        return target
    if label.dim() == 1:
        expand_target = expand_label(self.logits,label)
    else:
        expand_target = label.clone()

    self.targets = expand_target
    pred = torch.sigmoid(logits)
    pred_t = pred * self.targets + (1 - pred) * (1 - self.targets)
    alpha = torch.tensor(self.focal_alpha)

    # indicator = torch.clamp(indicator,min=0,max=1) # if indicator == 'pos' or neg
    map_val = 1 - indicator
    dy_gamma = self.focal_gamma + self.scale_factor * map_val
    # focusing factor
    ff = dy_gamma.view(1, -1).expand(self.n_i, self.n_c)
    # weighting factor
    wf = ff / self.focal_gamma

    # ce_loss = F.binary_cross_entropy_with_logits(self.logits,self.targets,reduction='none') #
    ce_loss = self.loss_fcn(self.logits, self.targets)
    efl = ce_loss * torch.pow((1 - pred_t), ff.detach()) * wf.detach()

    # to avoid an OOM error
    # torch.cuda.empty_cache()

    if self.focal_alpha >= 0:
        alpha_t = self.focal_alpha * self.targets + (1 - self.focal_alpha) * (1 - self.targets)
        efl = alpha_t * efl

    self.collect_grad(self.logits.detach(), self.targets.detach(),ff.detach(),wf.detach(),alpha.detach())

    if self.reduction == 'mean':
        return efl.mean()
    elif self.reduction == 'sum':
        return efl.sum()
    else:  # 'none'
        return efl
    # self.collect_grad(self.inputs, targets, self.outputs)

def collect_grad(self, logits, targets,ff,wf,alpha):
    pred = torch.sigmoid(logits)
    pos_grad = targets * torch.pow((1-pred),ff) * (ff * pred * torch.log(pred+1e-10) + (pred-1)) * wf
    neg_grad = (1-targets) * torch.pow(pred,ff) * (ff * (1-pred) * torch.log(1-pred+1e-10) - pred) * wf
    if alpha >= 0:
        pos_grad = pos_grad * alpha
        neg_grad = neg_grad * (1 - alpha)
    grad = pos_grad + neg_grad
    grad = torch.abs(grad)
    pos_grad = torch.sum(grad * targets,dim=0)
    neg_grad = torch.sum(grad * (1-targets),dim=0)

    # allreduce(pos_grad)
    # allreduce(neg_grad)

    self.pos_grad += pos_grad
    self.neg_grad += neg_grad
    self.pos_neg = torch.clamp(self.pos_grad / (self.neg_grad + 1e-10), min=0, max=1)
waveboo commented 1 year ago

Hi @yan-ctrl, Nan problem may be caused by many reasons. You could:

  1. Try the stabilized setting mentioned in our paper.
  2. Set the scale factor to 0, then the EFL will degenerate into focal loss. Check if focal loss is applicable to your task.
yan-ctrl commented 1 year ago

Thank you for your suggestion. There are some problems with focal loss

xiaoche-24 commented 6 months ago

Thank you for your suggestion. There are some problems with focal loss

老哥,请问你在yolov5上应用Softmax-EFL是否有效?我也出现了Nan的问题,请问您是否已解决?