loss = nan problem - Githubissues

sunmiaozju commented 6 years ago

您好，请问我在使用自己制作的数据集进行训练的时候，出现了loss=nan的问题，请问为什么会出现这样的情况呢，一般会是哪里出错呀？出错信息如下： 2018-07-25 11-13-06

sunmiaozju commented 6 years ago

而且似乎不是特定一组batch图片才会出现nan的问题，因为我把刚才出现nan的那组batch图片提取了出来，由这几张图片（4张）循环往复作为一个新的数据集，用这个新数据集进行训练结果发现又可以正常训练，似乎这组图片又不出现nan的错。

还有一个问题是我发现出现nan的时候，预测的二值化图和实例分割图各个像素都是0.

除此之外，每次训练出现nan的epoch都是不一样的图片batch也不同。

有什么建议吗，您觉得哪里出了问题？@MaybeShewill-CV 感谢

MaybeShewill-CV commented 6 years ago

@sunmiaozju 首先确保你的标签正确,其次可能针对不同的数据集你需要自己调整超参数.大多数nan的问题都是由于标签导致的。

sunmiaozju commented 6 years ago

我发现在出现loss=nan之前，预测前向输出的binary_logits像素均为0,所以我修改了训练过程，当出现前向输出binary_logits=0的时候，我不进行参数优化而是跳过这个batch，但是我发现之后所有的batch都是一样的出现了问题，即前向输出结果均等于0，。所以我现在认为不是数据集标签的问题，而是模型参数爆炸导致的，具体原因还再找

MaybeShewill-CV commented 6 years ago

@sunmiaozju 需要你调整超参数了

yiyichun commented 6 years ago

Hi, @sunmiaozju : 請問你在訓練的時候有修改到那些code嗎? 我遇到的問題是訓練時的accuracy都是很接近0的值. 感謝你的幫忙 :)

sunmiaozju commented 6 years ago

@yiyichun 你是自己制作的数据集吗，还是使用的图森的数据集，我在训练的时候没怎么修改模型的结构，会不会是你数据集的问题

yiyichun commented 6 years ago

Hi, @sunmiaozju : 我剛才試了作者提供的數據集,也就是在data\training_data_example\gt_image_binary 、data\training_data_example\gt_image_instance 、data\training_data_example\image這3個資料夾裡面的6張圖檔,並把它複製成30張,把BATCH_SIZE=2 (用原本的4,記憶體會不夠),但是得到的結果還是一樣耶 ><

MaybeShewill-CV commented 6 years ago

@yiyichun 等多几个epoch再看看

sunmiaozju commented 6 years ago

@MaybeShewill-CV 您好，我使用tusimple .json文件生成了数据集，并且使用模型训练了起来。但是使用自己的数据集却产生了nan问题。所以我想问一下，数据集的车道线标签，在聚类的时候或者二值化分割的时候，一定要是连续的车道线吗？我的数据集某些车道线是一个实例类，但是是断开的虚线段。就像下面这样： 2018-07-26 14-34-04

这样的形式会有问题吗？感谢

MaybeShewill-CV commented 6 years ago

@sunmiaozju 二值化分割是根据像素点进行分类的

sunmiaozju commented 6 years ago

@MaybeShewill-CV 已经可以训练了，谢谢啊

yiyichun commented 6 years ago

Hi, @sunmiaozju , @MaybeShewill-CV : 感謝你們的說明. 我試過在訓練時如果有加"--weights_path"這個參數的話,accuracy就會是0. 如果不加"--weights_path"這個參數的話,accuracy就會有值. 請問這樣是對的嗎 ? 另外,在測試的時候,跑出來的圖顏色變得好奇怪喔,請問哪裡錯了嗎?

MaybeShewill-CV commented 6 years ago

@yiyichun 你提供的信息不足以定位问题出现的原因, 如可以使用vgg预训练参数作为网络初始值来训练网络而不需要使用--weights_path参数。看这个图的问题你的pixel embedding失效, 导致后期mean shift聚类失败

yiyichun commented 6 years ago

hi, @MaybeShewill-CV : 感謝您的說明. 可能我說的不清楚,我在說明清楚一點好了. 如果是python3 tools/train_lanenet.py --net vgg --dataset_dir data/training_data_example/ ===>accuracy的值不會是0 如果是python3 tools/train_lanenet.py --net vgg --dataset_dir data/training_data_example/ --weights_path model/tusimple_lanenet/tusimple_lanenet_vgg_2018-05-21-11-11-03.ckpt-94000 ===>accuracy值會是0 請問"tusimple_lanenet_vgg_2018-05-21-11-11-03.ckpt-94000"這個是用vgg的網路訓練出來的嗎? 是否可以提供.pb 檔呢? 謝謝您

south-ocean commented 6 years ago

您好，想问下训练过程中其中的部分epoch出现NAN，您是怎么解决的谢谢！ 0727 01:57:11.438254 8959 train_lanenet.py:290] Epoch: 80 total_loss= 1.356155 binary_seg_loss= 0.239498 instance_seg_loss= 3.961687 accuracy= 0.144323 mean_cost_time= 1.340399s I0727 01:57:13.380253 8959 train_lanenet.py:290] Epoch: 81 total_loss= nan binary_seg_loss= 0.229159 instance_seg_loss= nan accuracy= 0.091045 mean_cost_time= 1.296569s I0727 01:57:15.414632 8959 train_lanenet.py:290] Epoch: 82 total_loss= 1.247565 binary_seg_loss= 0.254058 instance_seg_loss= 3.565749 accuracy= 0.105159 mean_cost_time= 1.392229s I0727 01:57:17.448236 8959 train_lanenet.py:290] Epoch: 83 total_loss= 1.521073 binary_seg_loss= 0.211007 instance_seg_loss= 4.577894 accuracy= 0.063831 mean_cost_time= 1.382824s I0727 01:57:19.482539 8959 train_lanenet.py:290] Epoch: 84 total_loss= 1.441414 binary_seg_loss= 0.211770 instance_seg_loss= 4.310584 accuracy= 0.028835 mean_cost_time= 1.390127s I0727 01:57:21.517747 8959 train_lanenet.py:290] Epoch: 85 total_loss= 1.413747 binary_seg_loss= 0.237343 instance_seg_loss= 4.158689 accuracy= 0.072766 mean_cost_time= 1.388580s I0727 01:57:23.480197 8959 train_lanenet.py:290] Epoch: 86 total_loss= 1.381895 binary_seg_loss= 0.257384 instance_seg_loss= 4.005753 accuracy= 0.020796 mean_cost_time= 1.394654s I0727 01:57:25.489401 8959 train_lanenet.py:290] Epoch: 87 total_loss= nan binary_seg_loss= 0.184804 instance_seg_loss= nan accuracy= 0.108197 mean_cost_time= 1.359407s @sunmiaozju @MaybeShewill-CV

MaybeShewill-CV commented 6 years ago

@yiyichun pb文件你可以用frozen graph脚本自己生成一下哈～

MaybeShewill-CV commented 6 years ago

@south-ocean 这个问题最有可能的情况是你的标签像素值不对,　先检查一下标签像素值是否正确吧

MaybeShewill-CV commented 6 years ago

@south-ocean 背景是20?

MaybeShewill-CV commented 6 years ago

@south-ocean 是的我的是这么设置的

south-ocean commented 6 years ago

我现在的二进制图片是背景为0,线为255，单通道图像。实例分割图像是背景为0,其他为20，70，120，170，220，也是单通道，这样子的数据集应该没问题吧？谢谢 @MaybeShewill-CV

MaybeShewill-CV commented 6 years ago

@south-ocean 是的我的数据集是这样设置的

south-ocean commented 6 years ago

好的，我再看下，谢谢！ @MaybeShewill-CV

MaybeShewill-CV commented 6 years ago

@south-ocean 好的欢迎分享结果

liuyangly25 commented 6 years ago

@MaybeShewill-CV 我的数据也出现了nan的情况，原因是如果label里面是空的（十字路口，停车场，密集车流等）就会产生nan。所以说训练集一定要有车道线了？如果图片中没有车道线，该怎么修改网络呢？

MaybeShewill-CV commented 6 years ago

@liuyangly25 放入网络前先检查label 没有车道线的话就跳过这张图像

liuyangly25 commented 6 years ago

@MaybeShewill-CV 这样的话，测试中一张没有线的照片也会跑出线了。网络能自动判别有没有线吗？

MaybeShewill-CV commented 6 years ago

@liuyangly25 没有测试过

mukaman84 commented 6 years ago

I found which one causes nan. This is because of tf.norm in discriminative_loss_single function. When the smallest value like 0 is given to the input of tf.norm, the derivative of tf.norm is going to inf. So I replaced tf.norm with tf.reduced_sum(tf.square()) and then training is well. Since tf.reduced_sum(tf.square()) isn't identical to tf.norm, Cutoff variance distance and cutoff cluster distance parameters also should be changed. Of course, it depends on the distribution of dataset.

stubbornstubborn commented 6 years ago

我在训练的时候也遇到nan问题，然后训练停了，我将学习率又降低了10倍后，就可以正常训练了！

MaybeShewill-CV commented 6 years ago

@mukaman84 I'll check it whether it is the reason cause the problem. Thanks for sharing it with us.

yiyichun commented 6 years ago

@ding-hai-tao , 你的做法和我的一樣,我也是把學習率降低之後就可以訓練了,但是當epoch跑到一定次數之後,仍會有nan的情形出現. 而且accuracy只有在0.5~0.6之間,請問你有這樣的情形嗎? 謝謝!!

stubbornstubborn commented 6 years ago

@yiyichun 目前我才训练到10W次，没有出现nan的情况,在验证集上的accuracy=0.98。

yiyichun commented 6 years ago

@ding-hai-tao , 請問你一個epoch是多少張圖呢? 有改到作者的程式碼嗎? 我在test的時候會出現下面的情況,這是甚麼問題呢? 謝謝!

stubbornstubborn commented 6 years ago

@yiyichun 我没有改动作者的代码。会不会数据集不一样？

yiyichun commented 6 years ago

hi, @ding-hai-tao : 請問你跑出來的圖會像上面那樣嗎? 我使用的訓練集是TuSimple dataset,裡面的label_data_0313.json、label_data_0531.json、label_data_0601.json裡面的images 謝謝!

stubbornstubborn commented 6 years ago

@yiyichun 我用的也是tusimple的数据集。我降低了学习率后，没有出现nan的情况，训练很正常，我看了损失函数和val的准确率维持在0.98多点点。你的hnet网络能用嘛？

yiyichun commented 6 years ago

@ding-hai-tao , 感謝你的回復 :) 請問你跑出來的圖會長怎樣呢? 我在想可能是我在製作gt_image_instance、gt_image_binary時有點問題. 請問你是怎樣製作這2個檔案的呢?

stubbornstubborn commented 6 years ago

@yiyichun 我是用opencv制作的那两个文档。

yiyichun commented 6 years ago

@MaybeShewill-CV ,@ding-hai-tao , 我試出來了,謝謝你們 :) 目前在還在訓練中,但是有用跑10000次的結果看一下,之前的問題都解掉了. 請問除了tusimple 、culane dataset之外,還有其他的training dataset嗎?

MaybeShewill-CV commented 6 years ago

@yiyichun 不客气, 之前是什么原因导致你无法训练的呢?

yiyichun commented 6 years ago

@MaybeShewill-CV , 在讀取gt_image_instance的時候有出現維度錯誤的訊息,我修正維度使之可以訓練,但是訓練出來的結果怪怪的.

cardwing commented 6 years ago

@yiyichun , you can train your model in BDD100K dataset. It provides 100k video frames to train a lane detection model. However, it does not provide basic training and testing codes and the number of lanes is not fixed.

cardwing commented 6 years ago

@yiyichun , besides, since you mentioned that you have trained your model in CULane dataset, I wonder how well your model performs (e.g., F1 measure in different road categories and in total). I just want to know the performance of the model so that I can decide whether I should use this code. Thanks a lot ! ^_^

stubbornstubborn commented 6 years ago

@yiyichun @MaybeShewill-CV 我將学习率降低了十倍后,发现训练得到的结果不好.

这个是过拟合造成的吧?

cardwing commented 6 years ago

@ding-hai-tao , you should use data augmentation like random cropping to prevent over-fitting.

cardwing commented 6 years ago

@ding-hai-tao , besides, I also find that the decoder part is not rational since it uses a 16 x 16 deconvolution layer and such larger kernel is hard to learn. Maybe you need to modify the decoder part.

stubbornstubborn commented 6 years ago

@yiyichun 你训练得到的结果好吗?

stubbornstubborn commented 6 years ago

@cardwing 你修改后,训练效果怎么样?

stubbornstubborn commented 6 years ago

@MaybeShewill-CV 训练的时候,什么类型的图片会使训练出现梯度爆炸.

我训练时出现梯度爆炸的图: nan_binary_label nan_embedding nan_image nan_instance_label

这应该是什么原因?

cardwing commented 6 years ago

@ding-hai-tao , do you meet the situation where cost, instance cost and binary cost are nan?

MaybeShewill-CV / lanenet-lane-detection

loss = nan problem #33