ERROR - cost is :nan - Githubissues

743341 commented 6 years ago

hi, Thanks for sharing your codes. I have encountered a mistake as follows when I was training the net using CULane dataset based on your pre-trained models. Would you please tell me the possible mistakes I took?Thanks, and this is my command to start training: python tools/train_lanenet.py --net vgg --dataset_dir data/training_data/ --weights_path model/tusimple_lanenet/tusimple_lanenet_vgg_2018-05-21-11-11-03.ckpt-94000
And I am using miniconda and the env is python 3.6.2 .

Thank you. Best regards Wzx

743341 commented 6 years ago

The details of the mistake: 0606 10:56:44.363587 33657 train_lanenet.py:155] Global configuration is as follows: I0606 10:56:44.363887 33657 train_lanenet.py:156] {'TRAIN': {'EPOCHS': 2000, 'DISPLAY_STEP': 1, 'TEST_DISPLAY_STEP': 1000, 'MOMENTUM': 0.9, 'LEARNING_RATE': 1e-05, 'GPU_MEMORY_FRACTION': 0.85, 'TF_ALLOW_GROWTH': True, 'BATCH_SIZE': 10, 'VAL_BATCH_SIZE': 10, 'LR_DECAY_STEPS': 210000, 'LR_DECAY_RATE': 0.1, 'CLASSES_NUMS': 2, 'IMG_HEIGHT': 256, 'IMG_WIDTH': 512}, 'TEST': {'GPU_MEMORY_FRACTION': 0.8, 'TF_ALLOW_GROWTH': True, 'BATCH_SIZE': 32}} I0606 10:56:44.799321 33657 train_lanenet.py:168] Restore model from last model checkpoint model/tusimple_lanenet/tusimple_lanenet_vgg_2018-05-21-11-11-03.ckpt-94000 INFO:tensorflow:Restoring parameters from model/tusimple_lanenet/tusimple_lanenet_vgg_2018-05-21-11-11-03.ckpt-94000 I0606 10:56:44.799610 33657 tf_logging.py:82] Restoring parameters from model/tusimple_lanenet/tusimple_lanenet_vgg_2018-05-21-11-11-03.ckpt-94000 E0606 10:57:37.360645 33657 train_lanenet.py:226] cost is: nan E0606 10:57:37.360998 33657 train_lanenet.py:227] binary cost is: 1.76500 E0606 10:57:37.361078 33657 train_lanenet.py:228] instance cost is: nan

MaybeShewill-CV commented 6 years ago

@743341 Maybe you should check whether your dataset was correctly prepared. Then you may decrease the learning rate or training the model from scratch. You are welcomed to share your solution here if you have solved the problem

ygren commented 6 years ago

@MaybeShewill-CV I also met the same problem， the instance cost is nan .At the same time, I found that your training files do not make good use of GPU resources.

MaybeShewill-CV commented 6 years ago

@YgRen You are welcomed to supply your data feeding pip line if it is more efficient. Thanks a lot.

jess2422 commented 6 years ago

I've also found this problem; it arises after running a different number of batches each time I try, but it still works on some batches, which makes me think I did set up the training data correctly. I tried using both the pre-trained model and training the model from scratch. Has anyone found a solution?

edit: The batches that cause this error are the ones that do not have lane markings. Would it be possible to make the cost function more robust to accommodate for these cases?

yiyichun commented 6 years ago

Hello, how can I solve the problem of nan? I also encountered the same problem. thank you:) default

MaybeShewill-CV commented 6 years ago

@yiyichun The main cause of the problem is the training label you'd better check your training label in my opinion

yiyichun commented 6 years ago

Hello, @jess2422, @YgRen , @743341 : Can you solve the problem of "nan"? @MaybeShewill-CV : Is the json file will only be used when generating the binary segmentation image and the instance segmentation image? Will it be used during training?

thank you:)

MaybeShewill-CV commented 6 years ago

@yiyichun Once you have made your training dataset the json file is useless

jess2422 commented 6 years ago

@yiyichun No I have not been able to solve it, sorry. Would also appreciate to know if anyone did.

wangqingyang105 commented 6 years ago

I just make sure that the label file is in the form of PNG file and solve the problem.

yiyichun commented 6 years ago

Hi, @wangqingyang105 : thank you for your reply. Now I can train :) But I still have a problem, accuracy value is always 0 or very close to 0. Do you have such a problem?

SmileRay commented 6 years ago

@yiyichun
I also encountered this problem.：accuracy value is always 0 or very close to 0. I don't know how to solve it. How are you doing now?

wangqingyang105 commented 6 years ago

@yiyichun @SmileRay I also meet this problem when i want to train the model with dense or vgg without any weights, then i try to train the model with vgg and pretrained weights, it works. But the problem is the output for lanenet is quite good but not the H-net,it can not detect the lane by the binary output, then i try to output the instance image and use segementation to detect lane marker.

yiyichun commented 6 years ago

hi, @wangqingyang105, @SmileRay : Thank you for your reply. I tried without the "--weights_path" parameter, the accuracy will have a value. If I use the "--weights_path" parameter, the accuracy value will be 0. Moreover, the results of the implementation have become very strange. Do you have such a situation?

wangqingyang105 commented 6 years ago

@yiyichun I don't have such Problem. The Problem for me could be after Training with my dataset ,it can still not detect the lane correctly, sometimes only one lane.

MaybeShewill-CV / lanenet-lane-detection

ERROR - cost is :nan #4