Warmup stage (Notebook vs. frontend.py)

letilessa commented 6 years ago

I read several issues related to the warmup stage, but I am still confused. Also the code from the notebook differs from frontend.py

On the notebook there is only the WARM_UP_BATCHES variable that people are setting to 3 (#237), which seems to be the warmup_epochs from frontend.py, and the warmup_batches defined on frontend.py as shown below doesn't exist on the notebook.

self.warmup_batches = warmup_epochs * (train_times*len(train_generator) + valid_times*len(valid_generator))

The general advise to do the warmup is to first set it to 3 then set it to 0 to do the actual training (#46, #335). But I am not sure if I am doing it right on the notebook or how it works with that code.

What is happening is that on training the recall goes to zero after a while, and the predictions are giving very small values with poor fitting boxes.

rodrigo2019 commented 6 years ago

@letilessa I also don't understand very well how it works, but I found a problem in this repository, the training always save the best loss, but t the best loss doesnt means the best mAP, so I changed the code in order to save the best mAP as well, my models predictions improved after doing that, here is my fork. Also I don't use early stop

letilessa commented 6 years ago

@rodrigo2019 What values did you use for the 4 scales? And what values did you get for mAP and loss using Mobilenet backend?

rodrigo2019 commented 6 years ago

4 scales? do you mean the anchors values? My work using this repository is focusing in car detection with high FPS on hardwares like raspberry. So I didn't used these backends, because it was too slow, even backends like mobilenet and tiny darknet, I designed my own network based on tiny darknet. I can't answer this question because I'am using a custom backend and a custom dataset, but I can say I'am getting really good predictions, my mAP is around 85-92 and 10-15fps on a hardware like raspberry.

letilessa commented 6 years ago

4 scales?

I mean the object_scale, no_object_scale, coord_scale and class_scale. The default values on this repository are 5,1,1,1 respectively, but on the yolo paper it seems that he used no_object_scale=0,5 and coord_scale=5. I also saw @experiencor advising to play with the 4 scales in other issues #46. Did you change these values?

rodrigo2019 commented 6 years ago

changing 1 to 3 helped the network to detect less false positives, but also make the network take more time to converge. I'am currently using this parameters

letilessa commented 6 years ago

Hi Rodrigo, I used your mAP callback, but it makes training much slower and it is giving back zero for every epoch, not showing any improvement.

How are you doing the warmup stage? I used WARM_UP_BATCHES=3 for 50 epochs, then WARM_UP_BATCHES=0 for 100 epochs, but I am still getting recall zero and mAP zero.

rodrigo2019 commented 6 years ago

@letilessa unfortunately it can not be faster, because the callback process the whole validation dataset. I'am using warmup = 3, I set epochs around 2k. if you check # 291 you can see that have trainings that start to compute mAP bigger than 0 after 30 ~ 40 epochs, I already got a training that I got some results after 210 epochs. I also get good results after 12hours of training on a gtx1070. Could you tell me what are you trying to train and which configurations are you using? maybe I can help you.

letilessa commented 6 years ago

I am training on pascal voc 2007+2012 similar to the yolo paper, but with mobilenet backend. Now I am trying to use repository code instead of the notebook, but when I load the weights from the warmup stage I get this error:

Traceback (most recent call last): File "train.py", line 116, in <module> _main_(args) File "train.py", line 92, in _main_ yolo.load_weights(config['train']['pretrained_weights']) File "/media/eHD/leticia/keras-yolo2/frontend.py", line 247, in load_weights self.model.load_weights(weight_path) File "/home/letica/.conda/envs/cipa2/lib/python3.5/site-packages/keras/engine/network.py", line 1181, in load_weights f, self.layers, reshape=reshape) File "/home/letica/.conda/envs/cipa2/lib/python3.5/site-packages/keras/engine/saving.py", line 916, in load_weights_from_hdf5_group reshape=reshape) File "/home/letica/.conda/envs/cipa2/lib/python3.5/site-packages/keras/engine/saving.py", line 557, in preprocess_weights_for_loading weights = convert_nested_model(weights) File "/home/letica/.conda/envs/cipa2/lib/python3.5/site-packages/keras/engine/saving.py", line 545, in convert_nested_model original_backend=original_backend)) File "/home/letica/.conda/envs/cipa2/lib/python3.5/site-packages/keras/engine/saving.py", line 557, in preprocess_weights_for_loading weights = convert_nested_model(weights) File "/home/letica/.conda/envs/cipa2/lib/python3.5/site-packages/keras/engine/saving.py", line 533, in convert_nested_model original_backend=original_backend)) File "/home/letica/.conda/envs/cipa2/lib/python3.5/site-packages/keras/engine/saving.py", line 675, in preprocess_weights_for_loading weights[0] = np.transpose(weights[0], (3, 2, 0, 1)) File "/home/letica/.conda/envs/cipa2/lib/python3.5/site-packages/numpy/core/fromnumeric.py", line 575, in transpose return _wrapfunc(a, 'transpose', axes) File "/home/letica/.conda/envs/cipa2/lib/python3.5/site-packages/numpy/core/fromnumeric.py", line 52, in _wrapfunc return getattr(obj, method)(*args, **kwds) ValueError: axes don't match array

rodrigo2019 commented 6 years ago

I would say that you are using old version of keras, but I'am not sure. (I'am using the version 2.1.5)

btw, look this comment, after 5 hours of training I started to get some results.

I will start a training with mobilenet today and give you a answer tomorrow.

rodrigo2019 commented 6 years ago

@letilessa, there is my results after 10 epochs:

config.zip I didn't used pre trained weights for mobileNet

letilessa commented 6 years ago

Are you doing warmup together with training? I first ran the code with warmup_epochs=3 and nb_epochs=0 to do warmup then I ran with warmup_epochs=0 and nb_epochs=100 to train. The loss on warmup was around 11, but on training it became nan.

I see that you are using different anchors, where did you get these values? The workers and max_queue_size make any difference on training?

rodrigo2019 commented 6 years ago

@letilessa I'am runing the warmup in the same training. I generate the anchor using the gen_anchors.py script. The workers are how many threads do you have to pre process the batch generator, and the max_queue_size is how many pre processed batch can wait to enter in the training. Using good values for these parameters you can speed up your training, I found these values in a impiric way.

letilessa commented 6 years ago

Hi @rodrigo2019, did you finish training with mobilenet? Can you tell me your email?

rodrigo2019 commented 6 years ago

@letilessa I stopped at epoch 10, because I already got some results, but I can do a full trainning if necessary. my email is rodrigormda@hotmail.com

zenoZhao commented 6 years ago

@experiencor i have the same problem as @letilessa, i used the YOLO-step-by-step to train on my own dataset, it has five classes. The problem is that after a few epochs, current recall and total recall are reduced to 0. It seems that the notebook does not support the warm up. Any advice will be appreciated!

abhijithvnair94 commented 5 years ago

@rodrigo2019 I am also trying to make a network for car detection. I like to know on what basis you made the custom backend. could you please share to me in my mail abhijith.mtmt17@iitp.ac.in

Aaron4Fun commented 5 years ago

@letilessa Have you fix your problem? I always got Nan at the beginning of training even I changed the anchors and used the pre-trained weight. I used YOLOv2 as backend. Is it relative to the Warmup training? thanks in advance.

experiencor / keras-yolo2

Warmup stage (Notebook vs. frontend.py) #343