loss = nan problem - Githubissues

MaybeShewill-CV / lanenet-lane-detection

Unofficial implemention of lanenet model for real time lane detection

Apache License 2.0

2.36k stars 886 forks source link

loss = nan problem #33

Closed sunmiaozju closed 6 years ago

sunmiaozju commented 6 years ago

您好，请问我在使用自己制作的数据集进行训练的时候，出现了loss=nan的问题，请问为什么会出现这样的情况呢，一般会是哪里出错呀？出错信息如下： 2018-07-25 11-13-06

yiyichun commented 6 years ago

@ding-hai-tao, 我使用約3000張來訓練,約700張來驗證. train_accuracy約在0.99附近.我跑影片來做測試,有些時候會有抓不到道路資訊. 不過可以增加訓練資料重新訓練,效果會在提升。

stubbornstubborn commented 6 years ago

@yiyichun 我用的是70%的图片作为训练集，30%的图片作为验证集。验证集上的acc=0.98+ ，但是小于0.99。用图片测的时候，发现效果不是很好。你的参数是怎么设置的？同时我感觉这数据量有点少，你是通过传统的数据增强方法来增加训练资料的嘛？

globalmaster commented 6 years ago

@MaybeShewill-CV 已经可以训练了，谢谢啊

你好，我用的是图森的数据集，也出现了nan的情况，请问你是怎么解决的？谢谢

phjhuang commented 6 years ago

@yiyichun

culane 训练效果怎么样

leonfrank commented 6 years ago

@sunmiaozju 你好，请问你的超参数怎么设置的？我只要一跑到第二个batch loss就是nan了

131404060321 commented 6 years ago

@MaybeShewill-CV 我剔除了训练样本中十字路口（gt_label中没有标注）的数据后，程序正常运行没有立即出现 loss = nan 的情况，当我训练一万多次时程序出现了loss=nan的情况，如图： loss nan_10245 这种情况应该不是数据标注的问题吧，是不是我需要调整一下参数了？

MaybeShewill-CV commented 6 years ago

@131404060321 这个应该不是数据标注的问题应该是需要调整参数

MaybeShewill-CV commented 6 years ago

@All The new code updated recently will lead to more stable training process. Welcome to test the new code

fayechou commented 5 years ago

@ding-hai-tao , 你的做法和我的一樣,我也是把學習率降低之後就可以訓練了,但是當epoch跑到一定次數之後,仍會有nan的情形出現. 而且accuracy只有在0.5~0.6之間,請問你有這樣的情形嗎? 謝謝!!

你好，我遇到的问题和你的一模一样，请问你解决了吗？是怎么解决的啊？

Sand0001 commented 5 years ago

已经可以训练了，谢谢啊

请问一下含有这种不连续的虚线的数据集应该怎么制作instance和binary？是每一类别是一个instance 还是每一段是一个instance 相应的binary应该怎么制作，求指导万分感谢

zacario-li commented 5 years ago

请问，我使用tusimple数据集，训练50个epoch后，accuracy直接变成0了，我看了这个issue里面好多人也碰到了这个问题，但是没说是什么原因，请问 @MaybeShewill-CV 你大概知道是哪里有问题吗？

ASONG0506 commented 5 years ago

请问，我使用tusimple数据集，训练50个epoch后，accuracy直接变成0了，我看了这个issue里面好多人也碰到了这个问题，但是没说是什么原因，请问 @MaybeShewill-CV 你大概知道是哪里有问题吗？

请问你解决这个问题了吗

MaybeShewill-CV commented 5 years ago

@ASONG0506 你可以检查下你的Tensorflow版本和代码库是否是新的，我本地从来没遇到过这个问题：）

niskov commented 5 years ago

I found which one causes nan. This is because of tf.norm in discriminative_loss_single function. When the smallest value like 0 is given to the input of tf.norm, the derivative of tf.norm is going to inf. So I replaced tf.norm with tf.reduced_sum(tf.square()) and then training is well. Since tf.reduced_sum(tf.square()) isn't identical to tf.norm, Cutoff variance distance and cutoff cluster distance parameters also should be changed. Of course, it depends on the distribution of dataset.

Could you please post your code after the changes?