train loss change from normal to NAN

becauseofAI / lffd-pytorch

A light and fast one class detection framework for edge devices. We provide face detector, head detector, pedestrian detector, vehicle detector......

MIT License

204 stars 36 forks source link

train loss change from normal to NAN #2

Open dtiny opened 5 years ago

dtiny commented 5 years ago

Provided code : python configuration_10_320_20L_5scales_v2.py Provided data : widerface_train_data_gt_8.pkl At begining, train loss converge normal. iteration times 3400, loss was divergent to nan.
How to solve this problem.

coderhss commented 5 years ago

我也出现了同样的问题

xinyikb commented 4 years ago

same problem +1

Brain-Lee commented 4 years ago

Have you found the inference code

120276215 commented 4 years ago

代码有问题：

loss写的不对，在难例挖掘那块
gray区域在loss里面也没有使用

把1改过来，如不行再降低初始学习率；2可改可不改

suyue6 commented 4 years ago

出现同样的问题+1

suyue6 commented 4 years ago

代码有问题：

loss写的不对，在难例挖掘那块

gray区域在loss里面也没有使用

把1改过来，如不行再降低初始学习率；2可改可不改

你好，请教具体怎么改呀，谢谢~~

Jialeen commented 4 years ago

有人解决这个问题了吗

120276215 commented 4 years ago

代码有问题：

loss写的不对，在难例挖掘那块

gray区域在loss里面也没有使用

把1改过来，如不行再降低初始学习率；2可改可不改

你好，请教具体怎么改呀，谢谢~~

https://github.com/becauseofAI/lffd-pytorch/blob/f7da857f7ea939665b81d7bfedb98d02f4147723/ChasingTrainFramework_GeneralOneClassDetection/loss_layer_farm/loss.py#L112

改为： torch.ones_like(pred_score_softmax[:, 1, :, :]).add(1))

Jialeen commented 4 years ago

代码有问题：

loss写的不对，在难例挖掘那块

gray区域在loss里面也没有使用

把1改过来，如不行再降低初始学习率；2可改可不改

你好，请教具体怎么改呀，谢谢~~

https://github.com/becauseofAI/lffd-pytorch/blob/f7da857f7ea939665b81d7bfedb98d02f4147723/ChasingTrainFramework_GeneralOneClassDetection/loss_layer_farm/loss.py#L112

改为： torch.ones_like(pred_score_softmax[:, 1, :, :]).add(1))

这样修改还是有NAN

chenjun2hao commented 4 years ago

the same problem when training.

deep-practice commented 4 years ago

@becauseofAI Any suggestions?

Manideep08 commented 4 years ago

Did anyone find any solution?

junaiddk commented 4 years ago

Anyone found the solution to this problem?

afterimagex commented 3 years ago

无力吐槽，这代码放出来专门坑人的

CodexForster commented 3 years ago

Try reducing the learning rate (variable name = param_learning_rate) to 0.01 in the configuration file. If you are using V2, it should be configuration_10_320_20L_5scales_v2.py. This worked for me to train for 2000000 training loops. EDIT: I see that user 120276215 has already advised the same. So credits to him/her.