ModelTC / United-Perception

United Perception
Apache License 2.0
428 stars 65 forks source link

EFL 迁移到mmdetection精度低 #27

Open EthanChen1234 opened 2 years ago

EthanChen1234 commented 2 years ago

hi,你好!

EFL在迁到mmdetection中,私有化数据集,仅将原生的FocalLoss修改为EqualizedFocalLoss,mAP 0.032 比原生 mAP 0.425低很多。 在迁移过程中,有些疑问:

1、模型差异,包括: UP中的retinanet模型,有RFS、iou_branch_loss、采用hand_craft生成anchor、atss正负样本分配。 mmdet中,bbox回归采用L1 Loss、每个位置生成9个anchor、maxIouAssigner正负样本分配。

2、正负样本梯度比观察 self.pos_neg = torch.clamp(self.pos_grad / (self.neg_grad + 1e-10), min=0, max=1) print(self.pos_neg)

image

log.txt.tar.gz 训练过程中,不同类别正负梯度比变化很小,说明EFL并没有起到作用?

EFL的实现、超参等,已经检查多遍。 帮忙分析下,是什么问题?

EthanChen1234 commented 2 years ago

补充实验: self.pos_neg = torch.clamp(self.pos_grad / (self.neg_grad + 1e-10), min=0, max=1) 修改为 torch.ones(self.num_classes) 这样,使得EqualizedFocalLoss退化为FocalLoss,mAP 0.442 正常。

所以,是efl的泛化性不够么?

waveboo commented 2 years ago

Hi @EthanChen1234 , We have demonstrated the generalizability of EFL in our paper. it could work well with RetinaNet even without the ATSS strategy.

Actually, the unbearably low performance (3.2 mAP) of your code on mmdet may come from some missing details during the code migration process. So I highly recommend you to check the gradient collection function. If you use the hook mechanism to collect the gradient, you need to put the hook on the last layer except the classifier. If you calculate the gradient manually, please check carefully of your derivatives formula.

If you still have some difficulties, please feel free to ask questions here. And better to provide your gradient collection function to help us locate the problem. Additional, the log file you provided is not in the UTF-8 format, and we could not open it.

EthanChen1234 commented 2 years ago

@waveboo 感谢你的快速回复!!!

1、梯度收集hook 梯度收集位置: (retina_cls): Conv2d(256, 369, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) 附件中列出了bbox_head的模型结构、梯度收集hook的实现采用UP中实现方式 gradient_collector_hook.py.zip

2、EFL实现(梯度收集实现) 参考原生efl实现,接口、参数适配。 efl.py.zip

3、训练日志 包括环境信息、完整模型、精度指标、每次迭代的 self.pos_neg log.tar.gz

4、损失观察 采用focal loss时,loss_cls 的值始终比较大

image

采用efl时,loss_cls 的快速下降到 0.02(详细见训练日志)。

waveboo commented 2 years ago

Hi @EthanChen1234, I check your hook position, and it seems true. However, the final gradients of categories under normal conditions must have some close to 1 (for frequent class). Thus I guess is there any problem when pos/neg_grad matches with the gt target. But now I can't help you debug because you use your own code and dataset. So I suggest you do these two things either:

  1. Using your mmdet code to train LVIS v1.0, to check whether the code on mmdet is right?
  2. Using UP code to train your dataset, to check whether the EFL is truly has some problems? Any questions are welcome to discuss here, thanks~
EthanChen1234 commented 2 years ago

@waveboo @Joker-co 在UP(之前的代码库EOD)上训练,采用 efl_improved_baseline_r50_2x_rfs.yaml 作为模板 训练过程中,打印正样本梯度、负样本梯度、正负梯度比: self.pos_neg = torch.clamp(self.pos_grad / (self.neg_grad + 1e-10), min=0, max=1) print(self.pos_grad) print(self.neg_grad) print(self.pos_neg) # 训练过程中,一直为1

在训练过程中,发现正样本梯度比负样本梯度大,self.pos_neg 一直为1(EFL实际变为了FL) train.log.zip

这个不符合预期吧

waveboo commented 2 years ago

Hi @EthanChen1234 , I have check your training log. The positive gradient is truly greater than negative gradient at most times. It identifies that EFL is equivalent to FL. Your experiment demonstrate two things:

  1. Your mmdet implementation has some problems about the gradient collection mechanism which you need to carefully check.
  2. The training status of your dataset (with RFS) is relatively balanced, because the gradient ratios are almost equal to 1 for all classes. In this situation, as we claim in out paper, it has little difference with the usage of EFL or FL. In fact, even the COCO dataset is also not absolutely balance because there are some categories with not too much training instance (like hair drier).
EthanChen1234 commented 2 years ago

@waveboo 这是EOD上训练的日志。 train.log.zip 里面的self.pos_neg 一直为1

这是mmdet的训练日志。 log.tar.gz 里面的self.pos_neg 训练过程中一直远小于1

waveboo commented 2 years ago

@EthanChen1234 The UP result is more credible because it is from the original code base. If your mmdet code has no bugs, it should got a similar result with UP (need to using the same detectors and RFS), because the precision of UP should be little difference with the mmdet.

FL77N commented 2 years ago

@waveboo 这是EOD上训练的日志。 train.log.zip 里面的self.pos_neg 一直为1

这是mmdet的训练日志。 log.tar.gz 里面的self.pos_neg 训练过程中一直远小于1

我之前也在 mmdet 上用过 efl,我记得里面的 self.pos_neg 也基本是为 1 的

EthanChen1234 commented 2 years ago

@FL77N 您好,我在mmdet上复现还是有些问题,您可以帮我review下么?

Hahahdamowang commented 2 years ago

hi,你好!

EFL在迁到mmdetection中,私有化数据集,仅将原生的FocalLoss修改为EqualizedFocalLoss,mAP 0.032 比原生 mAP 0.425低很多。 在迁移过程中,有些疑问:

1、模型差异,包括: UP中的retinanet模型,有RFS、iou_branch_loss、采用hand_craft生成anchor、atss正负样本分配。 mmdet中,bbox回归采用L1 Loss、每个位置生成9个anchor、maxIouAssigner正负样本分配。

2、正负样本梯度比观察 self.pos_neg = torch.clamp(self.pos_grad / (self.neg_grad + 1e-10), min=0, max=1) print(self.pos_neg) image log.txt.tar.gz 训练过程中,不同类别正负梯度比变化很小,说明EFL并没有起到作用?

EFL的实现、超参等,已经检查多遍。 帮忙分析下,是什么问题?

您好,请问我在mmdet使用了您复现的loss代码。在我训练的过程中出现memory一直增大的现象,指导out of memory,请问如何解决?

wudizuixiaosa commented 2 years ago

hi,你好!

EFL在迁到mmdetection中,私有化数据集,仅将原生的FocalLoss修改为EqualizedFocalLoss,mAP 0.032 比原生 mAP 0.425低很多。 在迁移过程中,有些疑问:

1、模型差异,包括: UP中的retinanet模型,有RFS、iou_branch_loss、采用hand_craft生成anchor、atss正负样本分配。 mmdet中,bbox回归采用L1 Loss、每个位置生成9个anchor、maxIouAssigner正负样本分配。

2、正负样本梯度比观察 self.pos_neg = torch.clamp(self.pos_grad / (self.neg_grad + 1e-10), min=0, max=1) print(self.pos_neg) image log.txt.tar.gz 训练过程中,不同类别正负梯度比变化很小,说明EFL并没有起到作用?

EFL的实现、超参等,已经检查多遍。 帮忙分析下,是什么问题?

hi,你好,请问你解决这个问题了么?

shiyuanyu123 commented 1 year ago

@EthanChen1234 您好,看了一下您复现的eql的程序,对于collect_grad中有一个倒序的操作,我和您有相同的疑惑,如果您现在明白了这个倒序操作的意义,是否可以讲解一下,谢谢。

waveboo commented 1 year ago

@shiyuanyu123 倒序是为了subnet设计的,就比如ATSS有五层的输出,forward的时候,EFL把这些输出对应的gt label按从前到后的顺序concat起来,用cache_target记录了一下,在backward的时候,五层的输出对应了五个hook,这些hook都会调用efl的collect_grad,调用顺序是五层的倒序(类似于栈的调用过程),所以,当EFL统计到五层的时候,会将grad_buffer的顺序反转,以便和cache_target记录的gt一一对应。

shiyuanyu123 commented 1 year ago

@waveboo 谢谢!我还有一个问题想请教您,当第一次迭代时,计算efl时还没有进行没有梯度反向传播,是不是意味着第一次迭代时没有计算Focal Factor以及Weight Factor,之后迭代也都是在前一次的正负梯度累加的基础上去计算这两个因子?

waveboo commented 1 year ago

@shiyuanyu123 对,第一次迭代时,累积梯度比是初始化为1的,这个时候的EFL等价于常规的FL,随着后面的迭代的进行,EFL会根据累积正梯度和累积负梯度不断重新计算梯度比,分配学习比重

shiyuanyu123 commented 1 year ago

@waveboo 非常感谢您!