Retraining model on new dataset

dscha09 commented 5 years ago

i was successful in testing the trained model by getting the trained weights you uploaded in Dropbox. However, I want to retrain the model on new training data.

I added one new image for the existing training data of five images following the instructions in the repo and added new images in the image, gt_image_instance, and gt_image_binary folders, but i get errors. I enter this line from your repo in bash:

python tools/train_lanenet.py --net vgg --dataset_dir data/training_data_example/

The errors I get are:

cv2.error: OpenCV(3.4.2) /Users/travis/build/skvark/opencv-python/opencv/modules/imgproc/src/resize.cpp:4044: error: (-215:Assertion failed) !ssize.empty() in function 'resize'

and sometimes i get this error:

ValueError: Variable lanenet_loss/inference/encode/conv1_1/conv/W already exists

I already modified the train.txt and val.txt and changed the file paths for the images found locally on my machine.

How to fix this?

dscha09 commented 5 years ago

@MaybeShewill-CV Hmmm... you mean train_lanenet.py? I was trying to run it line by line.. but I get a different error which is not present when I run it in bash.

Update:

I changed the train and test size to both 1. Same with val batch size, and I successfully started the training.

Okay. I have to follow your suggestion and to read the code again, until the end, so I can get a sense of what is going on.

MaybeShewill-CV commented 5 years ago

@chaine09 Since the training process works I will close this issue:)

dscha09 commented 5 years ago

@MaybeShewill-CV Okay! But before you close this, please tell me if you only trained with those 5 images or not?

MaybeShewill-CV commented 5 years ago

@chaine09 Yep it works well with only five image.

dscha09 commented 5 years ago

Hi @MaybeShewill-CV, I'm already training the model, currently at the 2000th epoch. But looking at the contents of /model/tusimple_lanenet folder, it only contains two checkpoints namely tusimple_lanenet_vgg_2018-10-19-13-33-56.ckpt-200000 which was the saved model from your Dropbox, and tusimple_lanenet_vgg_2018-11-02-19-13-33.ckpt-0 which was created during retraining the model.

Is it correct that for tusimple_lanenet_vgg_2018-10-19-13-33-56.ckpt-200000, "200000" is the epoch number?

I'm already on the 2000th epoch and there is still no checkpoint other than tusimple_lanenet_vgg_2018-11-02-19-13-33.ckpt-0.... I wanted to do testing for the model trained up to the current epoch, while training is still ongoing.

Should I let the entire training finish first? Before expecting a more current saved checkpoint? Or should I abort the current training?

I tried `tusimple_lanenet_vgg_2018-11-02-19-13-33.ckpt-2000 but it says that the file name doesn't exist

ValueError: The passed save_path is not a valid checkpoint: /Users/cvsanbuenaventura/Documents/lanenet-lane-detection-master/model/tusimple_lanenet/tusimple_lanenet_vgg_2018-11-02-19-13-33.ckpt-2000

MaybeShewill-CV commented 5 years ago

@chaine09 I wonder if you have used tensor flow before? Finish your training process and pay a little patience ok?

dscha09 commented 5 years ago

@MaybeShewill-CV Oh, because it's possible to save your checkpoint every epoch. I was wondering if you did that in your code?

dscha09 commented 5 years ago

@MaybeShewill-CV Nevermind my question, I see you saved the checkpoint every 2000 epochs :)

dscha09 commented 5 years ago

Hi @MaybeShewill-CV, i have one last question. How did you generate the binary and instance images for the training data?

MaybeShewill-CV commented 5 years ago

@chaine09 You can follow the tusimple dataset readme file. The training samples can be generated based on their guidence:)

dscha09 commented 5 years ago

Hi @MaybeShewill-CV, I also want to ask about some odd behavior I've noticed during retraining your model (using the original training dataset you provided).

I noticed that the accuracy dropped to 0 on the 9th epoch, and started to slowly rise again on the 30th.

Then on beyond 2000 epochs, the accuracy is almost 100%. The accuracy reached a steady value of 100% for epochs greater than 4000.

I just replicated your original data of 5 images. Then proceeded with training 17 images (replicated).

Although the training accuracy is 100%, the output generated for epochs greater than 2000 is not accurate.

For the 2000th epoch, I only got one lane line. Then for epochs greater than 4000, I only get a black image (even though the training accuracy is 100%).

Should I just continue with the training?

MaybeShewill-CV commented 5 years ago

@chaine09 With only five images you will get nothing

dscha09 commented 5 years ago

@MaybeShewill-CV What do you mean by get nothing?

What I did is copied and pasted the 5 images 4 times.. That's why I have a total of almost 20 images. is this sufficient?

MaybeShewill-CV commented 5 years ago

@chaine09 I suggest you to train the model on the whole tusimple dataset.

dscha09 commented 5 years ago

@MaybeShewill-CV May I ask, for the saved model you provided in Dropbox, how many images did you train it with?

MaybeShewill-CV commented 5 years ago

@chaine09 The model was trained with the whole tusimple lane dataset

dscha09 commented 5 years ago

@MaybeShewill-CV And you really trained it for 200010 epochs?

MaybeShewill-CV commented 5 years ago

@chaine09 。。。。。。 yes

dscha09 commented 5 years ago

@MaybeShewill-CV Can you tell me the format your code is accepting for both binary and instance segmentation files?

MaybeShewill-CV / lanenet-lane-detection

Retraining model on new dataset #74