Closed dscha09 closed 6 years ago
@chaine09 I recently update the code. You may update the new code and test the training process again and see if these problems still exist
Hi @MaybeShewill-CV this will work even if I only have 5 images as my training data?
@chaine09 The training process will work batch you have to reset your batch size to some smaller than 5.
Hi @MaybeShewill-CV I got this error upon retraining the model
RecursionError: maximum recursion depth exceeded in comparison
@chaine09 That may be caused by the improper dataset preparation. Could you please show me how you prepare your dataset including your dataset folder structure and your train.txt file
This is the contents of my train.txt
file
/Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master-retrain/data/training_data_example/image/0000.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master-retrain/data/training_data_example/gt_image_binary/0000.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master-retrain/data/training_data_example/gt_image_instance/0000.png
/Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master-retrain/data/training_data_example/image/0001.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master-retrain/data/training_data_example/gt_image_binary/0001.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master-retrain/data/training_data_example/gt_image_instancee/0001.png
/Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master-retrain/data/training_data_example/image/0002.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master-retrain/data/training_data_example/gt_image_binary/0002.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master-retrain/data/training_data_example/gt_image_instance/0002.png
/Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master-retrain/data/training_data_example/image/0003.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master-retrain/data/training_data_example/gt_image_binary/0003.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master-retrain/data/training_data_example/gt_image_instance/0003.png
I'm on Mac OS
@MaybeShewill-CV How would you generate the images in the gt_image_instance
, and gt_image_binary
folders?
For now, I just used the existing 5 images you have for training and validation and didn't change the training folder structure. But I modifiedval.txt
and train.txt
accordingly.
@MaybeShewill-CV I used the existing 5 images in your repo. Just tested if I can retrain the model.
@chaine09 You should change the batch size in config file. Because the batch size is larger than the total amount of your training image. Another way to test is to copy the training examples serval times and change the file names.
Hi @MaybeShewill-CV the training batch size is 32 and the test batch size is 4 in the global_config.py
file. Which one will I change to 5?
@chaine09 Did you pull the new code. The train batch size is 8 according to new config file
@MaybeShewill-CV Oh yes in the new code both train and test batch sizes are 8. Which batch size should I change to 5? Is it the train or the test batch size?
@chaine09 Train batch size and the best way you test is to copy the example images several times in my opinion. Good luck:)
Hi @MaybeShewill-CV I already changed the train batch size to 5, without changing the number of training images (the original 4 images). But I still get this error:
cv2.error: OpenCV(3.4.2) /Users/travis/build/skvark/opencv-python/opencv/modules/imgproc/src/resize.cpp:4044: error: (-215:Assertion failed) !ssize.empty() in function 'resize'
@chaine09 First check if the image path is correct. Second like I said before make sure your batch size is smaller than the total amount of your training examples. You can read the data provider code for details. It is quite simple.
Hi @MaybeShewill-CV, I checked the three python files in the data_provider
folder namely, data_process.py
, lanenet_data_processor.py
and lanenet_hnet_data_processor.py
.
For data_processor.py
, I found this line of code:
val = DataSet('/home/baidu/DataBase/Semantic_Segmentation/TUSimple_Lane_Detection/training/train.txt')
Similarly, for lanenet_data_processor.py
:
val = DataSet('/home/baidu/DataBase/Semantic_Segmentation/Kitti_Vision/data_road/lanenet_training/train.txt')
And lastly, for lanenet_hnet_data_processor.py
:
json_file_list = glob.glob('{:s}/*.json'.format('/media/baidu/Data/Semantic_Segmentation' '/TUSimple_Lane_Detection/training'))
These are all file paths that need to be modified accordingly. They are three different file paths in your local machine, but are referring to the same file name train.txt
. Are all of these referring to the train.txt
file inside the /data/training_data_example
folder?
@chaine09 The data processor for the model is lanenet_data_processor.py and you can find that I only import that file in my training script. When you train your model which you need to do is just to pass the folder path which includes the train.txt file to the trainner. You can use python tools/train_lanenet.py --help for help.
Hi @MaybeShewill-CV, I wanted to test if the code works for other training dataset(I used Cityscape instance datasets and extract only one class(car in my case) from it), however, it turns out that both the binary_loss and instance_loss reach nan so it is terminated. I investigated it and found out that all the neural layers have reached nan after several steps(so including the embedding layer,decoding layers and etc. ). I checked the nan image and the label ,and they look fine to me. Another interesting phenomenon is that the error always showed up during the validation part of the trainging process(I use tf.Print to check real-time values including the tensors liek the input of decoder, mu, and all losses. It turned out when the error showed up, they are all becoming nan)Have you encountered similar case before? Thanks a lot.
@HanqingXu Sorry I have not tested that model on other dataset which is set up for other task before==!
Hi @MaybeShewill-CV, I already changed this line in lanenet_data_processor.py
accordingly:
val = DataSet('/home/baidu/DataBase/Semantic_Segmentation/Kitti_Vision/data_road/lanenet_training/train.txt')
Then I already changed the train batch_size
to 2 and test batch_size
to 1, since the number of training data I have is 4 while only 1 for validation (validation in this case is same as test? or did you split the 4 images further?).
However, I still get this error:
RecursionError: maximum recursion depth exceeded in comparison
Did I do it correctly?
@chaine09 You did not change it in a right way. Do not change the lanenet_data_processor.py file only change the batch size in the config.py file and make sure the batch size is smaller than the total amount of your training examples.(By the way the total amount of your training examples is equal to the number of the lines in your train.txt file):)
@MaybeShewill-CV should I not change val
in lanenet_data_processor.py
?
from
val = DataSet('/home/baidu/DataBase/Semantic_Segmentation/Kitti_Vision/data_road/lanenet_training/train.txt')
to
val = DataSet('/Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/data/training_data_example/train.txt')
Should I not do this?
Then here are the contents of my train.txt
:
/Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/data/training_data_example/image/0000.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/data/training_data_example/gt_image_binary/0000.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/data/training_data_example/gt_image_instance/0000.png
/Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/data/training_data_example/image/0001.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/data/training_data_example/gt_image_binary/0001.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/data/training_data_example/gt_image_instancee/0001.png
/Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/data/training_data_example/image/0002.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/data/training_data_example/gt_image_binary/0002.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/data/training_data_example/gt_image_instance/0002.png
/Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/data/training_data_example/image/0003.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/data/training_data_example/gt_image_binary/0003.png /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/data/training_data_example/gt_image_instance/0003.png
which I modified.
@chaine09 You are not supposed to change the code in lanenet_data_processor.py file
@MaybeShewill-CV what about train.txt
? So I have 4 training examples in this case?
@chaine09 The train.txt seems to be correct, make sure the three file path in the same line are seperated by blank. Adjust your batch size to 2 or 3 then you can start training the model.
@MaybeShewill-CV I also downloaded vgg16.npy and placed it inside the data
folder by issuing this command:
wget ftp://mi.eng.cam.ac.uk/pub/mttt2/models/vgg16.npy
Is it correct that the config file you are referring to is the one global_config.py
inside the config
folder?
Here are the contents of my global_config.py
file:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# @Time : 18-1-31 上午11:21
# @Author : Luo Yao
# @Site : http://icode.baidu.com/repos/baidu/personal-code/Luoyao
# @File : global_config.py
# @IDE: PyCharm Community Edition
"""
设置全局变量
"""
from easydict import EasyDict as edict
__C = edict()
# Consumers can get config by: from config import cfg
cfg = __C
# Train options
__C.TRAIN = edict()
# Set the shadownet training epochs
__C.TRAIN.EPOCHS = 200010
# Set the display step
__C.TRAIN.DISPLAY_STEP = 1
# Set the test display step during training process
__C.TRAIN.TEST_DISPLAY_STEP = 1000
# Set the momentum parameter of the optimizer
__C.TRAIN.MOMENTUM = 0.9
# Set the initial learning rate
__C.TRAIN.LEARNING_RATE = 0.0005
# Set the GPU resource used during training process
__C.TRAIN.GPU_MEMORY_FRACTION = 0.85
# Set the GPU allow growth parameter during tensorflow training process
__C.TRAIN.TF_ALLOW_GROWTH = True
# Set the shadownet training batch size
__C.TRAIN.BATCH_SIZE = 2 # changed 8 to 2
# Set the shadownet validation batch size
__C.TRAIN.VAL_BATCH_SIZE = 8
# Set the learning rate decay steps
__C.TRAIN.LR_DECAY_STEPS = 410000
# Set the learning rate decay rate
__C.TRAIN.LR_DECAY_RATE = 0.1
# Set the class numbers
__C.TRAIN.CLASSES_NUMS = 2
# Set the image height
__C.TRAIN.IMG_HEIGHT = 256
# Set the image width
__C.TRAIN.IMG_WIDTH = 512
# Test options
__C.TEST = edict()
# Set the GPU resource used during testing process
__C.TEST.GPU_MEMORY_FRACTION = 0.8
# Set the GPU allow growth parameter during tensorflow testing process
__C.TEST.TF_ALLOW_GROWTH = True
# Set the test batch size
__C.TEST.BATCH_SIZE = 1
Here is the error I'm getting:
File "/Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/data_provider/lanenet_data_processor.py", line 93, in next_batch self._random_dataset() File "/Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/data_provider/lanenet_data_processor.py", line 66, in _random_dataset random_idx = np.random.permutation(len(self._gt_img_list)) File "mtrand.pyx", line 4907, in mtrand.RandomState.permutation File "mtrand.pyx", line 4824, in mtrand.RandomState.shuffle File "/Users/cvsanbuenaventura/miniconda3/envs/tensorflow_orig/lib/python3.5/site-packages/numpy/core/_internal.py", line 254, in init if self._arr.ndim == 0: RecursionError: maximum recursion depth exceeded in comparison
@chaine09 Add a break point on train script see if the number of batch size was correctly passed.
@MaybeShewill-CV How do I add a breakpoint and what do you mean by "train script"?
@chaine09 About breakpoint you can google how to use a IDE to debug. Train script means the train_lanenet.py file
@MaybeShewill-CV Is it not the value for the training batch size is specified in the global_config.py
file?
You mean I need to run and debug the train_lanenet.py
script? Line by line? How would I know if the number of batch size was correctly passed?
@chaine09 Since I do not know how you use the model you need to check if the batch size parmas value is correct according to the debugger.
2018-10-31 23:12:10.135633: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA I1031 23:12:12.052288 16525 train_lanenet.py:163] Global configuration is as follows: I1031 23:12:12.053390 16525 train_lanenet.py:164] {'TEST': {'TF_ALLOW_GROWTH': True, 'GPU_MEMORY_FRACTION': 0.8, 'BATCH_SIZE': 1}, 'TRAIN': {'TEST_DISPLAY_STEP': 1000, 'CLASSES_NUMS': 2, 'EPOCHS': 200010, 'VAL_BATCH_SIZE': 8, 'LR_DECAY_STEPS': 410000, 'IMG_WIDTH': 512, 'DISPLAY_STEP': 1, 'GPU_MEMORY_FRACTION': 0.85, 'LEARNING_RATE': 0.0005, 'MOMENTUM': 0.9, 'LR_DECAY_RATE': 0.1, 'IMG_HEIGHT': 256, 'TF_ALLOW_GROWTH': True, 'BATCH_SIZE': 2}}
From this, I think the train batch size passed is 2.
@chaine09 It seems you have the right batch size. Then I have no idea about this. Maybe you should recheck your training procedure or you can test data provider alone. The implementation of data provider is quite simple I think you can figure it out by yourself. Everything you said was tested correctly yesterday by myself:)
Hi @MaybeShewill-CV Now I'm getting an error with testing the model, which I did not get from your previous code.
I used this:
# from /lanenet-lane-detection-master
python tools/test_lanenet.py --is_batch False --batch_size 1 \
--weights_path /Users/cvsanbuenavts/lanenet-lane-detection-master/model/tusimple_lanenet/tusimple_lanenet_vgg_2018-10-19-13-33-56.ckpt-200000.data-00000-of-00001 \
--image_path data/tusimple_test_image/0.jpg
Then I get this error:
DataLossError (see above for traceback): Unable to open table file /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/model/tusimple_lanenet/tusimple_lanenet_vgg_2018-10-19-13-33-56.ckpt-200000.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator? [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
Does this has something to do with the weights_path
I specified? I copied the saved model from https://www.dropbox.com/sh/tnsf0lw6psszvy4/AAA81r53jpUI3wLsRW6TiPCya?dl=0 and put it inside /model/tusimple_lanenet
.
When I used your old code, I was able to successfully generate the 3 output images.
@chaine09 It seems that you are not familiar with tensorflow==! The weights_path should be --weights_path /Users/cvsanbuenavts/lanenet-lane-detection-master/model/tusimple_lanenet/tusimple_lanenet_vgg_2018-10-19-13-33-56.ckpt-200000
Hello @MaybeShewill-CV! the last issue I asked about, about testing the model, please disregard it. It was caused by a problem in bash. Sometimes when I edit the commands in bash or paste commands, some parts of it would overlap or be pasted twice. I already fixed it. Thanks! :)
For the training, I have to double check the entire process again.
@chaine09 Yep you'd better check the process again. I have tested it several times on my computer nothing wrong happened:)
Hi @MaybeShewill-CV, I did what you suggested before, which is to copy multiple copies of the 5 train images for the image
, gt_image_instance
, and gt_image_binary
. I also updated the train.txt
file and added the copies. I have a total of 18 images for training. And still 1 image for validation.
I entered this in bash inside the /lanenet-net-detection-master
folder:
python tools/train_lanenet.py --net vgg --dataset_dir data/training_data_example/
Then here are the details of the training:
2018-11-02 15:27:49.233191: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA I1102 15:27:49.842538 2106 train_lanenet.py:163] Global configuration is as follows: I1102 15:27:49.843230 2106 train_lanenet.py:164] {'TRAIN': {'TEST_DISPLAY_STEP': 1000, 'EPOCHS': 200010, 'IMG_WIDTH': 512, 'IMG_HEIGHT': 256, 'LEARNING_RATE': 0.0005, 'DISPLAY_STEP': 1, 'LR_DECAY_RATE': 0.1, 'CLASSES_NUMS': 2, 'GPU_MEMORY_FRACTION': 0.85, 'VAL_BATCH_SIZE': 1, 'TF_ALLOW_GROWTH': True, 'BATCH_SIZE': 2, 'LR_DECAY_STEPS': 410000, 'MOMENTUM': 0.9}, 'TEST': {'TF_ALLOW_GROWTH': True, 'BATCH_SIZE': 1, 'GPU_MEMORY_FRACTION': 0.8}} I1102 15:27:52.105909 2106 train_lanenet.py:172] Training from scratch
But after this, i get a new error:
ValueError: Cannot feed value of shape (1, 256, 512, 3) for Tensor 'input_tensor:0', which has shape '(2, 256, 512, 3)'
Any ideas on how to solve this?
@chaine09 Your shape of your input tensor is (2, 256, 512, 3) but you feed it with a tensor with shape (1, 256, 512, 3). You must have changed my code so the input pip line got stuck. Next time hope you paste your code here:)
@MaybeShewill-CV no I haven't changed your code. Yeah from the looks of it the input tensor is receiving a tensor with a different shape other than the on specified.
Oh, I just realized that the input tensor shape should be (2, 256, 512, 3)
. Why? Is it not (1, 256, 512, 3)
?
Am i correct that the shape of the image is 256 x 512
? And 3 is because of the RGB component of the image right?
@MaybeShewill-CV Hmm, the tensor you feed to input_tensor
is gt_imgs
right? I checked the shape of gt_imgs
and its shape is really (2, 256, 512, 3)
.
gt_imgs, binary_gt_labels, instance_gt_labels = train_dataset.next_batch(CFG.TRAIN.BATCH_SIZE)
gt_imgs = [cv2.resize(tmp,
dsize=(CFG.TRAIN.IMG_WIDTH, CFG.TRAIN.IMG_HEIGHT),
dst=tmp,
interpolation=cv2.INTER_LINEAR)
for tmp in gt_imgs]
gt_imgs = [tmp - VGG_MEAN for tmp in gt_imgs]
You defined gt_imgs
thrice in train_lanenet.py
, and for each of these, the shape is (2, 256, 512, 3)
. I used np.array(gt_imgs).shape
.
@chaine09 Make sure the feed value and the placeholder has the same shape:)
Hi @MaybeShewill-CV, I'm not training on GPU just CPU. I found this line of code. Should I change this?
with tf.device('/gpu:1'):
What does this do?
@chaine09 If you want to use cpu instead then you are supposed to change it to ('/cpu:0')
Hi @MaybeShewill-CV Since I'm using CPU only for training, I made the following changes as you have suggested:
I changed
with tf.device('/gpu:1'):
to
with tf.device('/cpu:0'):
Then I found this block of code:
sess_config = tf.ConfigProto(allow_soft_placement=True)
sess_config.gpu_options.per_process_gpu_memory_fraction = CFG.TRAIN.GPU_MEMORY_FRACTION
sess_config.gpu_options.allow_growth = CFG.TRAIN.TF_ALLOW_GROWTH
sess_config.gpu_options.allocator_type = 'BFC'
sess = tf.Session(config=sess_config)
So I changed the last line to just
sess = tf.Session()
And retrained the model.. however, I'm still getting this error:
ValueError: Cannot feed value of shape (1, 256, 512, 3) for Tensor 'input_tensor:0', which has shape '(2, 256, 512, 3)'
So I was thinking of 3 probable sources of this error:
Incorrect image type or structure in the 3 folders image
, gt_image_instance
, and gt_image_binary
However, I just copied the 5 existing files and replicated it with different file names (up to img0017). Then I also updated the train.txt
file by adding the copies.
Need to modify some parts of the global_config.py
file found in /config
So far, I only modified the train batch size to 2 and test batch size to 1. Should I still modify other parts of this file?
Need to change the path in lanenet_data_processor.py
to point to the train.txt
file in my local computer.
But you told me to not change this file.
Need to modify train_lanenet.py
code?
But i have re-downloaded your updated code, and you told me that you have tried to retrain it on your computer with no errors.
Your code is for Tensorflow with GPU?
But I already modified the train_lanenet.py
code for this. Did I do it correctly? However the error is about the incorrect shape of the input_tensor
.
Hmmm... which do you think is the most probable source of the error? Thanks in advance :)
@chaine09 There is only one placeholder as input tensor so the test batch size and the train batch size must be the same:)
@MaybeShewill-CV Oh.. I changed the train and test batch size to 2 just now. The val batch size is still 1.
2018-11-02 19:09:49.783596: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA I1102 19:09:50.384888 2509 train_lanenet.py:162] Global configuration is as follows: I1102 19:09:50.385061 2509 train_lanenet.py:163] {'TEST': {'TF_ALLOW_GROWTH': True, 'GPU_MEMORY_FRACTION': 0.8, 'BATCH_SIZE': 2}, 'TRAIN': {'DISPLAY_STEP': 1, 'LEARNING_RATE': 0.0005, 'LR_DECAY_RATE': 0.1, 'IMG_HEIGHT': 256, 'MOMENTUM': 0.9, 'IMG_WIDTH': 512, 'GPU_MEMORY_FRACTION': 0.85, 'VAL_BATCH_SIZE': 1, 'LR_DECAY_STEPS': 410000, 'TEST_DISPLAY_STEP': 1000, 'EPOCHS': 200010, 'CLASSES_NUMS': 2, 'TF_ALLOW_GROWTH': True, 'BATCH_SIZE': 2}}
Still the same error tho.
@chaine09 Seems that you even read nothing about the code. The cfg.VAL.BATCH_SIZE is used in the code. Please read it first. The question you ask about is too easy for you to correct.
i was successful in testing the trained model by getting the trained weights you uploaded in Dropbox. However, I want to retrain the model on new training data.
I added one new image for the existing training data of five images following the instructions in the repo and added new images in the
image,
gt_image_instance
, andgt_image_binary
folders, but i get errors. I enter this line from your repo in bash:python tools/train_lanenet.py --net vgg --dataset_dir data/training_data_example/
The errors I get are:
cv2.error: OpenCV(3.4.2) /Users/travis/build/skvark/opencv-python/opencv/modules/imgproc/src/resize.cpp:4044: error: (-215:Assertion failed) !ssize.empty() in function 'resize'
and sometimes i get this error:
I already modified the train.txt and val.txt and changed the file paths for the images found locally on my machine.
How to fix this?