farizrahman4u / seq2seq

Sequence to Sequence Learning with Keras
GNU General Public License v2.0
3.17k stars 845 forks source link

using categorical_crossentropy get loss result is nan #189

Open martin3252 opened 7 years ago

martin3252 commented 7 years ago

Hi guys,i am a beginner, i have a question here when i used categorical_crossentropy , I encountered a "loss : nan" at first epoch, I don't know why, can someone help me? thanks

model = AttentionSeq2Seq(hidden_dim=hidden_dim, output_length=tar_maxlen, output_dim=output_dim,input_shape=(input_maxlen, word_vector_size)) model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

X = np.zeros((len(inside_data)/2, input_maxlen, word_vector_size), dtype=np.float) Y = np.zeros((len(inside_data)/2, tar_maxlen,vocab_size),dtype=np.bool)

s_index=0

        for x in range(len(inside_data)/2):

                if len(inside_data)%2==0:

                    if s_index ==(len(inside_data)/2):
                        break
                    #print dialog_lines_for_train
                    for t_index, token in enumerate(inside_data[s_index]):
                        #print dialog_lines_for_train[s_index][:25]
                        if t_index==tar_maxlen:
                            break
                        X[s_index, t_index] = get_token_vector(token, vector_dict)

                    for t_index, token in enumerate(inside_data[s_index + 1]):
                        if t_index==tar_maxlen:
                            break
                        if token not in word_to_idx:
                            print token
                            continue
                        else :    
                        #print dialog_lines_for_train[s_index+1][:25]
                            Y[s_index,t_index,word_to_idx[token]] = 1
                    s_index=s_index+2

model.fit(X, Y, batch_size=batch_size, nb_epoch=5,verbose=1)`

the result : Epoch 1/5 1416/1416 [==============================] - 3s - loss: nan - acc: 0.7864
Epoch 2/5 1416/1416 [==============================] - 3s - loss: nan - acc: 0.8039
Epoch 3/5 1416/1416 [==============================] - 3s - loss: nan - acc: 0.8039
Epoch 4/5 1416/1416 [==============================] - 3s - loss: nan - acc: 0.8039
Epoch 5/5 1416/1416 [==============================] - 3s - loss: nan - acc: 0.8039

abhaikollara commented 7 years ago

You're using zeros for X which will cause the multiplication with the weights to yield zero, if you're just testing it out use np.random.normal instead of np.zeros

martin3252 commented 7 years ago

@abhaikollara thanks for reply But I assigned X value at for loop at X[s_index, t_index] = get_token_vector(token, vector_dict), and get _token_vector is 300 dimension Glove vector without zeros

X[0][0]=[ -4.20904545e-02 -6.77937273e-02 -2.28589636e-01 -7.88417727e-02 ...] which is 300 dimension word vector Y[0][0]=[False False False False False False False False False False False False.. True False False ] which is 483 dimension with one hot vector I don't think zero value exist in X ?? am i right??

Slyne commented 7 years ago

I found the activation for the output y in seq2seq is "tanh", which means the output is a value between (-1, 1). For the categorical cross-entropy between predictions and targets: $L_i = - \sumj{t{i,j} \log(p_{i,j})}$
the value p_ij would be in (-1,1), so the loss may be nan when p_ij in (-1,0]. However, for the softmax activation, the value pij would be normalized to (0,1], which would not be a problem for categorical_crossentropy.

Therefore, you may try to add TimeDistributed(Dense(vocab_size)) and Softmax activation after seq2seq. OR you can try to modify the source code to make activations for the output be customized. @farizrahman4u

inzaghi250 commented 7 years ago

@Slyne @farizrahman4u I also have the same question. If I add a Dense layer with softmax activation after seq2seq, how to make teacher-forcing work, e.g., constitute the training input[x, y]?

Slyne commented 7 years ago

@inzaghi250 I guess it may look like something like this: fr03e7jx3k d5443a7c4 _e x should be the embedding of each input word y should be one-hot for each label

inzaghi250 commented 7 years ago

@Slyne Thanks for the figure. I am actually using this one. My question is, in our models, will teacher-force be problematic? Our (x, y) for this model is the input of Embedding and output of Softmax. However, for the current implementation of @farizrahman4u , teacher-force requires the ground truth of input of Dense as y to feed to the next unit, which is, however, implicit in our models.

Slyne commented 7 years ago

@inzaghi250 Do you mean we should use t_b w cm c d rs yol 2 y_t directly into the next layer instead of adding an activation function like 'tanh' ? I'm also confused about that. But from the experiments I did, I thought the result and loss are both acceptable...

ChristopherLu commented 7 years ago

@Slyne how to parse the output to the new network added with softmax? Is it still in the format of [length_output, num_classes]? And did you add softmax in the seq2seq source code or we may directly append softmax?

Slyne commented 7 years ago

@ChristopherLu yes. you can add TimeDistributed(Dense(vocab_size)) and then Softmax activation directly, no need to change the source code.

ChristopherLu commented 7 years ago

@Slyne It runs, thx. Just to confirm, in the case we added dense and softmax after seq2seq, how do you set the the parameters of output_length and output_dim in Seq2Seq(..., output_length=tar_maxlen, output_dim=output_dim, ...)? Is it output_dim == vocab_size?

martin3252 commented 7 years ago

@Slyne it works fine ,thanks a lot

gibrown commented 7 years ago

@martin3252 see PR #199 which let's you set the output activation layer.

@Slyne thanks for the recommendation.

@farizrahman4u should softmax become the default? Or something else to make it less likely folks will run into this problem when using seq2seq?

ChristopherLu commented 7 years ago

@martin3252 @Slyne @gibrown After applying model.add(TimeDistributed(Dense(output_dim))) and model.add(Activation('softmax')) of AttentionSeq2Seq, did you guys ever meet the problem of gradient vanishing (on tensorflow backedn, and on I tried on theano as well, it gives me the training loss of NaN....) as follows:

Epoch 3/100 20400/20455 [============================>.] - ETA: 1s - loss: 1.4559 - acc: 0.5113 Epoch 00002: val_acc improved from 0.52442 to 0.53723, saving model to ./tmp/models/model_entropy_16_0.005_100_20455_0.1_2.hdf5 20455/20455 [==============================] - 712s - loss: 1.4557 - acc: 0.5114 - val_loss: 1.4287 - val_acc: 0.5372 Epoch 4/100 20400/20455 [============================>.] - ETA: 1s - loss: 0.6295 - acc: 0.4340 Epoch 00003: val_acc did not improve 20455/20455 [==============================] - 707s - loss: 0.6278 - acc: 0.4339 - val_loss: 0.0000e+00 - val_acc: 0.3672 Epoch 5/100 20400/20455 [============================>.] - ETA: 1s - loss: 0.0000e+00 - acc: 0.3610 Epoch 00004: val_acc did not improve 20455/20455 [==============================] - 701s - loss: 0.0000e+00 - acc: 0.3610 - val_loss: 0.0000e+00 - val_acc: 0.3672 Epoch 6/100 20400/20455 [============================>.] - ETA: 1s - loss: 0.0000e+00 - acc: 0.3610 Epoch 00005: val_acc did not improve

Is it because the learning rate of my adam opt is not setting to the right (currently setting to 0.005)?

Slyne commented 7 years ago

@ChristopherLu which loss function you tried? I used categorical_crossentropy and it worked well..

ChristopherLu commented 7 years ago

@Slyne I used model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])

After setting the learning_rate of Adam as default, the NaN problem gets alleviated, however, it still comes up after a hundred epochs.

Epoch 140/200 20400/20455 [============================>.] - ETA: 0s - loss: 1.3045 - acc: 0.5588Epoch 00139: val_acc did not improve 20455/20455 [==============================] - 82s - loss: 1.3046 - acc: 0.5587 - val_loss: 1.5386 - val_acc: 0.5147 Epoch 141/200 20400/20455 [============================>.] - ETA: 0s - loss: 1.4245 - acc: 0.5226Epoch 00140: val_acc did not improve 20455/20455 [==============================] - 82s - loss: 1.4244 - acc: 0.5226 - val_loss: 1.4046 - val_acc: 0.5257 Epoch 142/200 20400/20455 [============================>.] - ETA: 0s - loss: nan - acc: 0.4012Epoch 00141: val_acc did not improve 20455/20455 [==============================] - 81s - loss: nan - acc: 0.4012 - val_loss: nan - val_acc: 0.3672

And this problem is more obvious when I use the AttentionSeq2Seq, general Seq2Seq works well with no NaN interruption.

sculyi commented 7 years ago

@Slyne Hi, I am troubling about add TimeDistributed(Dense(vocab_size)) and Softmax activation to the model to use the categorical_crossentropy loss. Would you please tell me how to add the above layers? modify models.py, append it to the end of decoder? OR modify cells.py?

amirkhango commented 5 years ago

@Slyne Hi, I am troubling about add TimeDistributed(Dense(vocab_size)) and Softmax activation to the model to use the categorical_crossentropy loss. Would you please tell me how to add the above layers? modify models.py, append it to the end of decoder? OR modify cells.py?

Need NOT modify source codes, just Add after the Seq2Seq line in your project file.