Loss nan on Jupyter notebook

IMABUNNEH commented 6 years ago

Hi,

I've seen the issue was resolved for someone here: https://github.com/experiencor/keras-yolo2/issues/237

However I've set my warmup batches to 3 and I'm still getting nan on training.

This only occurs when trying to use training with a weights file previously created through training (in an attempt to improve on it), rather than doing it fresh. Any ideas?

ZacharyForrest commented 6 years ago

If you're resuming training with your pretrained weights, try loading your weights here just before compilation like so. Worked for me!


optimizer = Adam(lr=0.5e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
#optimizer = SGD(lr=1e-4, decay=0.0005, momentum=0.9)
#optimizer = RMSprop(lr=1e-4, rho=0.9, epsilon=1e-08, decay=0.0)

model.load_weights("YOURWEIGHTS.h5")

model.compile(loss=custom_loss, optimizer=optimizer)

model.fit_generator(generator        = train_batch, 
                    steps_per_epoch  = len(train_batch), 
                    epochs           = 100, 
                    verbose          = 1,
                    validation_data  = valid_batch,
                    validation_steps = len(valid_batch),
                    callbacks        = [early_stop, checkpoint, tensorboard], 
                    max_queue_size   = 3)

tamersalama commented 6 years ago

Trying to update the nan related issues - what worked for me is adding images and annotations added to "valid_image_folder" (I previously relied on having the training ones split 80/20 as per the readme - but I got nan for losses). I also changed the traing nb_epochs to 10 - and will likely need more (from 1) Actually, it might have to do with model anchors. Generating new ones (other than the ones given in the README example) lead to the NAN values.

letilessa commented 5 years ago

On issue #237 what do they mean by warmup stage?

rodrigo2019 commented 5 years ago

It is a trick. It makes the each cell match with the size of anchors, this makes the weights get better values to start than just random values. It is not wrote in the original paper, but it seens the original author does the same thing on his implementation on c++. Abraço :)

letilessa commented 5 years ago

To do the warmup stage I just need to assign a value to WARM_UP_BATCHES, or do I need add something else to the code in the notebook?

I saw that I was missing the following lines in my code, and when I added them I was able to train the model for 10 epochs without the nan loss appearing for the pascal 2007 dataset. When I added the pascal 2012 dataset to train with the same data as the yolo paper, the nan loss appeared again in the middle of the first epoch, starting at the conf loss.

`layer = model.layers[-4] # the last convolutional layer weights = layer.get_weights()

new_kernel = np.random.normal(size=weights[0].shape)/(GRID_HGRID_W) new_bias = np.random.normal(size=weights[1].shape)/(GRID_HGRID_W)

layer.set_weights([new_kernel, new_bias])`

Why the last layer is refered to as [-4]? If add more layers between the feature extractor and this layer do I have to initialize their weights as well? Which index would I use to refer to the last 4 layers for example?

ZacharyForrest commented 5 years ago

If i recall correctly, just set WARM_UP_BATCHES and you're good. You can tell its working because your loss will jump (probably up) once the warm up is complete, so just keep an eye on the output

i think layer[-4] is the last convolutional layer of the 'feature extractor' prior to the 'object detection layers' in the complete model.

I had a lot of NaN issues when I wasnt using a properly configured GPU, something to check maybe

rodrigo2019 commented 5 years ago

check #291 this YOLO structure are based on a model inside another model, the layer[-5] is a full model with the convolutions layers, the layer [-4] it is a convolution layer that does the detections, the layer[-3] it is a reshape to organize the outputs, and the layer[-2] and [-1] it is a workaround to put the ground truth boxes inside the model when it is training, you can take it out after the training. If you check the code the detection layer has a special weight initializer, if you add more layer you will need to initialize with keras initializer or repeat the same initializer used on that layer.

letilessa commented 5 years ago

I had a lot of NaN issues when I wasnt using a properly configured GPU, something to check maybe.

I increased the capacity of the GPU, because before I was just using a fraction, but I am still getting the nan loss for the bigger dataset. How did you configure your GPU?

If you check the code the detection layer has a special weight initializer, if you add more layer you will need to initialize with keras initializer or repeat the same initializer used on that layer.

Do you mean the initializer code that I mentioned above or this keras kernel_initializer='lecun_normal' that you mentioned on #291? Do I have to use both or just one?

Another question, on the yolo paper he pre-trained the Darknet-19 on Imagenet then he added 3 convolutional layers before the detection layer to train for detection. Why didn't you add these 3 layers when using different backends? And why do you load the weights from external .h5 files instead of using the keras argument weights='imagenet' when you import the models?

Eg: MobileNet(input_shape=(224,224,3), include_top=False, weights='imagenet')

rodrigo2019 commented 5 years ago

Do you mean the initializer code that I mentioned above or this keras kernel_initializer='lecun_normal' that you mentioned on #291? Do I have to use both or just one?

I don't know if this special initializer must be used on all layers

And why do you load the weights from external .h5 files instead of using the keras argument weights='imagenet' when you import the models?

In my experience I got worst results using this pretrained weights, I'm also trying to create a script to train a backend model for classification before using it for detection, but I'm getting worst results as well doing that. I'am interested in create a custom backend capable to run at high FPS under CPU for simples tasks, so it is why I'am trying to re create all steps did by the original author, if anyone are interest, I'am inviting you to help me in the development at this branch

letilessa commented 5 years ago

Hi @rodrigo2019,

Do you know how to test the model speed in FPS?

rodrigo2019 commented 5 years ago

from predict.py import time . . .

        video_reader = cv2.VideoCapture(image_path)

        nb_frames = int(video_reader.get(cv2.CAP_PROP_FRAME_COUNT))
        frame_h = int(video_reader.get(cv2.CAP_PROP_FRAME_HEIGHT))
        frame_w = int(video_reader.get(cv2.CAP_PROP_FRAME_WIDTH))

        video_writer = cv2.VideoWriter(video_out,
                               cv2.VideoWriter_fourcc(*'MPEG'), 
                               50.0, 
                               (frame_w, frame_h))

        for i in tqdm(range(nb_frames)):
            _, image = video_reader.read()

            t=time.time()
            boxes = yolo.predict(image)
            print("fps: ", 1/(time.time()-t))
            image = draw_boxes(image, boxes, config['model']['labels'])

            video_writer.write(np.uint8(image))

I think this is the easiest way to do, you can improve that doing the mean value from N past samples.

ps: maybe you can improve the speed a lot predicting batches instead a single sample, but you will need change the code architeture and you will create a delay between the real time video and the predictions, maybe a very small delay, but this delay will be there

bkanaki commented 5 years ago

Hi @ZacharyForrest what do you mean by properly configured GPU? Any special steps you had to take? I have NaN values for randomly initialized weights. Were you using pretrained weights and still getting nan?

Aaron4Fun commented 5 years ago

@IMABUNNEH Have you solved your problem? I'm still getting nan even that I've set my warmup batches to 3 and reset the anchors. I used the pre-trained weight "full_yolo_backen.h5"

experiencor / keras-yolo2

Loss nan on Jupyter notebook #250