chenjun2hao / CenterFace.pytorch

unofficial version of centerface, which achieves the best balance between speed and accuracy at face detection
238 stars 60 forks source link

训练过程中出现loss突然全部为null是怎么回事呢 #34

Open wuxiaolianggit opened 4 years ago

wuxiaolianggit commented 4 years ago

您好,大神,训练过程中突然出现这个问题,怎么解决呢? multi_pose/dla |################# | train: [4][882/1583]|Tot: 0:02:16 |ETA: 0:01:49 |loss 3.2818 |hm_loss 2.8268 |lm_loss 3.1342 |wh_loss 0.5359 |off_loss 0.0880 |Data 0.001s(0.002s) |Net 0.multi_pose/dla |################# | train: [4][883/1583]|Tot: 0:02:16 |ETA: 0:01:49 |loss 3.2804 |hm_loss 2.8256 |lm_loss 3.1322 |wh_loss 0.5357 |off_loss 0.0880 |Data 0.001s(0.002s) |Net 0.multi_pose/dla |################# | train: [4][884/1583]|Tot: 0:02:16 |ETA: 0:01:48 |loss 3.2805 |hm_loss 2.8258 |lm_loss 3.1317 |wh_loss 0.5357 |off_loss 0.0880 |Data 0.001s(0.002s) |Net 0.multi_pose/dla |################# | train: [4][885/1583]|Tot: 0:02:16 |ETA: 0:01:47 |loss 3.2797 |hm_loss 2.8251 |lm_loss 3.1297 |wh_loss 0.5357 |off_loss 0.0880 |Data 0.002s(0.002s) |Net 0.multi_pose/dla |################################| train: [4][1582/1583]|Tot: 0:04:02 |ETA: 0:00:01 |loss nan |hm_loss nan |lm_loss nan |wh_loss nan |off_loss nan |Data 0.001s(0.001s) |Net 0.153s WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. @chenjun2hao

bendanzzc commented 4 years ago

output['hm']的sigmoid没有clip,如果出现0,或者1,focal loss 会有nan,把output['hm'] = torch.clamp(output['hm'], min=1e-4, max=1-1e-4)加在计算loss前即可

Q-Wang7 commented 4 years ago

output['hm']的sigmoid没有clip,如果出现0,或者1,focal loss 会有nan,把output['hm'] = torch.clamp(output['hm'], min=1e-4, max=1-1e-4)加在计算loss前即可

`

  output['hm'] = torch.clamp(output['hm'], min=1e-4, max=1 - 1e-4)   

  hm_loss += self.crit(output['hm'], batch['hm']) / opt.num_stacks          # 1. focal loss,求目标的中心,

` 请问是改成这样吗?加了之后还是会loss nan

bendanzzc commented 4 years ago

output['hm']的sigmoid没有clip,如果出现0,或者1,focal loss 会有nan,把output['hm'] = torch.clamp(output['hm'], min=1e-4, max=1-1e-4)加在计算loss前即可

`

  output['hm'] = torch.clamp(output['hm'], min=1e-4, max=1 - 1e-4)   

  hm_loss += self.crit(output['hm'], batch['hm']) / opt.num_stacks          # 1. focal loss,求目标的中心,

` 请问是改成这样吗?加了之后还是会loss nan

我是用自己的数据集训练的,train from scratch,改了之后就没有报nan了,你可以看一下哪里有类似的危险,log一般都要加个浮点数。是在不行你可以用他训练好的model做pretrain,调小lr,也可以避免nan

Q-Wang7 commented 4 years ago

output['hm']的sigmoid没有clip,如果出现0,或者1,focal loss 会有nan,把output['hm'] = torch.clamp(output['hm'], min=1e-4, max=1-1e-4)加在计算loss前即可

`

  output['hm'] = torch.clamp(output['hm'], min=1e-4, max=1 - 1e-4)   

  hm_loss += self.crit(output['hm'], batch['hm']) / opt.num_stacks          # 1. focal loss,求目标的中心,

` 请问是改成这样吗?加了之后还是会loss nan

我是用自己的数据集训练的,train from scratch,改了之后就没有报nan了,你可以看一下哪里有类似的危险,log一般都要加个浮点数。是在不行你可以用他训练好的model做pretrain,调小lr,也可以避免nan

好的,谢谢!

ucashyq commented 3 years ago

output['hm']的sigmoid没有clip,如果出现0,或者1,focal loss 会有nan,把output['hm'] = torch.clamp(output['hm'], min=1e-4, max=1-1e-4)加在计算loss前即可

output['hm']的sigmoid没有clip,如果出现0,或者1,focal loss 会有nan,把output['hm'] = torch.clamp(output['hm'], min=1e-4, max=1-1e-4)加在计算loss前即可

胡说八道,sigmoid函数输出范围是(0,1),nan的原因是landmarks没有归一化,不同尺寸脸,造成值差别很大

bendanzzc commented 3 years ago

output['hm']的sigmoid没有clip,如果出现0,或者1,focal loss 会有nan,把output['hm'] = torch.clamp(output['hm'], min=1e-4, max=1-1e-4)加在计算loss前即可

output['hm']的sigmoid没有clip,如果出现0,或者1,focal loss 会有nan,把output['hm'] = torch.clamp(output['hm'], min=1e-4, max=1-1e-4)加在计算loss前即可

胡说八道,sigmoid函数输出范围是(0,1),nan的原因是landmarks没有归一化,不同尺寸脸,造成值差别很大

说话前先自己试试pytorch或者看看文档sigmoid的取值范围吧,要不然大脸太疼了呢。其次没归一化可能会导致前期loss发散很容易踩到极端值,不加保护才会出现nan。不加归一化也可以训哦。

kaijieshi7 commented 3 years ago

output['hm']的sigmoid没有clip,如果出现0,或者1,focal loss 会有nan,把output['hm'] = torch.clamp(output['hm'], min=1e-4, max=1-1e-4)加在计算loss前即可

output['hm']的sigmoid没有clip,如果出现0,或者1,focal loss 会有nan,把output['hm'] = torch.clamp(output['hm'], min=1e-4, max=1-1e-4)加在计算loss前即可

胡说八道,sigmoid函数输出范围是(0,1),nan的原因是landmarks没有归一化,不同尺寸脸,造成值差别很大

那该怎么归一化?