I got a loss of nan when training the model - Githubissues

Neerajj9 / Text-Detection-using-Yolo-Algorithm-in-keras-tensorflow

Implemented the YOLO algorithm for scene text detection in keras-tensorflow (No object detection API used) The code can be tweaked to train for a different object detection task using YOLO.

MIT License

152 stars 57 forks source link

I got a loss of nan when training the model #3

Closed Tyre20 closed 3 years ago

Neerajj9 commented 4 years ago

I'll need some more info. Are you using the exact same code, data and loss function ?

Tyre20 commented 4 years ago

Everything is the same except for different dataset

Tyre20 commented 4 years ago

My accuracy shoot up to 100% after just one epoch

jeewenjie commented 4 years ago

Check your X.npy and Y.npy files. Make sure the data inside are clean (No '\n' or extra characters etc.) .

Tyre20 commented 4 years ago

I have the same error with the same dataset and code with loss of nan

Randheer91 commented 4 years ago

Screenshot from 2019-11-10 16-31-09 I have also face the same ERROR i.e., "val_loss did not improve from inf" and find accuracy and validation loss as "nan". Kindly suggest the proper solution for it. I have used the same code with the same dataset provided in your repository.

benjastudio commented 4 years ago

I think that the code has been written for another version of python that you use (python 3)? I got the same problem. I took a look at the code, and a bit of work have to be done to make it compatible with python 2/3 (some division operators need to be fixed). Can you confirm @Neerajj9 ?

goldengrisha commented 4 years ago

I've got the same issue, any updates?

goldengrisha commented 4 years ago

The problem was in inactivation function 'relu' . It can be fix as: 'activation=tf.nn.leaky_relu' or 'activation=tf.nn.elu'. I've achived the same result after 100 epochs.

mahirahzainipderas commented 4 years ago

can you explain more @goldengrisha ? I go both loss and val_loss nan and val_acc at 1. Seems incorrect for training data.

goldengrisha commented 4 years ago

@mahirahzainipderas it looks like you're facing the problem of daying gradients with ReLu activation function (that what NaN means -- very small activations). If you use model.compile(loss=yolo_loss_func , optimizer=opt , metrics = ['accuracy']) the metric isn't correct in this context instead just orient your focus for decreasing loss.

    inp = Input(input_shape)

    model = MobileNetV2(
        input_tensor=inp, include_top=False, weights='imagenet')
    last_layer = model.output
    conv = Conv2D(512, (3, 3), activation=tf.nn.leaky_relu,
                  padding='same')(last_layer)
    conv = Dropout(0.4)(conv)
    bn = BatchNormalization()(conv)
    lr = LeakyReLU(alpha=0.1)(bn)
    conv = Conv2D(128, (3, 3), activation=tf.nn.leaky_relu, padding='same')(lr)
    conv = Dropout(0.4)(conv)
    bn = BatchNormalization()(conv)
    lr = LeakyReLU(alpha=0.1)(bn)
    conv = Conv2D(5, (3, 3), activation=tf.nn.leaky_relu, padding='same')(lr)
    final = Reshape((grid_h, grid_w, classes, info))(conv)

    model = Model(inp, final)

noarotman commented 4 years ago

Hi, I am facing this problem now. I tried @goldengrisha tip but still got nan after the first step. Any updates?

goldengrisha commented 4 years ago

Hello, @noarotman what version of TF do you use?

noarotman commented 4 years ago

@goldengrisha I use the version that was written in "readme" file: Tensorflow : 1.9.0 and I run this on CPU..

goldengrisha commented 4 years ago

@noarotman try switching to TF 2.1 and check the code above, it should work.

noarotman commented 4 years ago

@goldengrisha thanks I will try. another issue is that when I try to run the yolo model I get this error:

goldengrisha commented 4 years ago

@noarotman, you're welcome, please check the last URL, it can be wrong.

Neerajj9 commented 4 years ago

Hello everyone, the code was written in python3 and the exact versions mentioned in the README.md file. There might be some compatibility issues as @goldengrisha pointed out which might be because of a different version of tensorflow or keras. Also yes concentrate on the decreasing loss as against the 'accuracy' metric.

HOD101s commented 4 years ago

@mahirahzainipderas it looks like you're facing the problem of daying gradients with ReLu activation function (that what NaN means -- very small activations). If you use model.compile(loss=yolo_loss_func , optimizer=opt , metrics = ['accuracy']) the metric isn't correct in this context instead just orient your focus for decreasing loss.
    inp = Input(input_shape)

    model = MobileNetV2(
        input_tensor=inp, include_top=False, weights='imagenet')
    last_layer = model.output
    conv = Conv2D(512, (3, 3), activation=tf.nn.leaky_relu,
                  padding='same')(last_layer)
    conv = Dropout(0.4)(conv)
    bn = BatchNormalization()(conv)
    lr = LeakyReLU(alpha=0.1)(bn)
    conv = Conv2D(128, (3, 3), activation=tf.nn.leaky_relu, padding='same')(lr)
    conv = Dropout(0.4)(conv)
    bn = BatchNormalization()(conv)
    lr = LeakyReLU(alpha=0.1)(bn)
    conv = Conv2D(5, (3, 3), activation=tf.nn.leaky_relu, padding='same')(lr)
    final = Reshape((grid_h, grid_w, classes, info))(conv)

    model = Model(inp, final)

Updating the activation functions worked for me. Thanks!