关于训练过程中loss突然变成nan，acc变成0的问题 - Githubissues

KaihuaTang / Long-Tailed-Recognition.pytorch

[NeurIPS 2020] This project provides a strong single-stage baseline for Long-Tailed Classification, Detection, and Instance Segmentation (LVIS). It is also a PyTorch implementation of the NeurIPS 2020 paper 'Long-Tailed Classification by Keeping the Good and Removing the Bad Momentum Causal Effect'.

GNU General Public License v3.0

560 stars 68 forks source link

关于训练过程中loss突然变成nan，acc变成0的问题 #17

Open isunLt opened 3 years ago

isunLt commented 3 years ago

您好，感谢您分享的代码，我在训练模型的过程中出现了loss突然变成nan，acc变成0的问题，我分别从头开始进行了两次训练，但是还是产生了一样的问题。我的训练环境是：

Nvidia RTX 2080Ti cuda10.1+cudnn7.6.3
python3.8.5+pytorch1.5.1
数据集是ILSVRC2015
由于显存只有11G，所以我在config里把batch_size改成了64
其他不变

请问您知道可能的原因是什么吗？您用的ImageNet-LT是由ILSVRC2015提取的吗？ loss2nan

loss2nan2

isunLt commented 3 years ago

不好意思打扰了，我把python换成3.7，pytorch换成1.6之后就没问题了。

isunLt commented 3 years ago

抱歉我又来了，在换成python3.7、pytorch1.6以后，到训练的最后还是出现了老问题。微信截图_20201026093554

KaihuaTang commented 3 years ago

不好意思没有遇到过类似问题，我也不知道为什么

KaihuaTang commented 3 years ago

可能是因为改了batch size，learning rate也需要对应的修改？

KaihuaTang commented 3 years ago

还有种可能是要在所有normalize的分母处加一个 1e-9 或者 1e-12。因为不知道什么原因分母的norm值训练的太小了，但是我自己没遇到类似问题。

isunLt commented 3 years ago

还有种可能是要在所有normalize的分母处加一个 1e-9 或者 1e-12。因为不知道什么原因分母的norm值训练的太小了，但是我自己没遇到类似问题。

谢谢您，我去试一下。

deepkun commented 3 years ago

请问您问题解决了吗？我改了norm还是会出现nan，我的loss下降很快，在一个epoch内就变nan了

isunLt commented 3 years ago

请问您问题解决了吗？我改了norm还是会出现nan，我的loss下降很快，在一个epoch内就变nan了

太久了，我忘记了，不好意思

yufu commented 2 years ago

还有种可能是要在所有normalize的分母处加一个 1e-9 或者 1e-12。因为不知道什么原因分母的norm值训练的太小了，但是我自己没遇到类似问题。

I had the same problem and I fixed it by following Tang's advice. That's really helpful, thx.