Open martin3252 opened 7 years ago
You're using zeros for X which will cause the multiplication with the weights to yield zero, if you're just testing it out use np.random.normal
instead of np.zeros
@abhaikollara thanks for reply
But I assigned X value at for loop at X[s_index, t_index] = get_token_vector(token, vector_dict)
, and get _token_vector is 300 dimension Glove vector without zeros
X[0][0]=[ -4.20904545e-02 -6.77937273e-02 -2.28589636e-01 -7.88417727e-02 ...] which is 300 dimension word vector Y[0][0]=[False False False False False False False False False False False False.. True False False ] which is 483 dimension with one hot vector I don't think zero value exist in X ?? am i right??
I found the activation for the output y in seq2seq is "tanh", which means the output is a value between (-1, 1).
For the categorical cross-entropy between predictions and targets:
$L_i = - \sumj{t{i,j} \log(p_{i,j})}$
the value p_ij would be in (-1,1), so the loss may be nan when p_ij in (-1,0].
However, for the softmax activation, the value pij would be normalized to (0,1], which would not be a problem for categorical_crossentropy.
Therefore, you may try to add TimeDistributed(Dense(vocab_size)) and Softmax activation after seq2seq. OR you can try to modify the source code to make activations for the output be customized. @farizrahman4u
@Slyne @farizrahman4u I also have the same question. If I add a Dense layer with softmax activation after seq2seq, how to make teacher-forcing work, e.g., constitute the training input[x, y]?
@inzaghi250 I guess it may look like something like this: x should be the embedding of each input word y should be one-hot for each label
@Slyne Thanks for the figure. I am actually using this one. My question is, in our models, will teacher-force be problematic? Our (x, y) for this model is the input of Embedding and output of Softmax. However, for the current implementation of @farizrahman4u , teacher-force requires the ground truth of input of Dense as y to feed to the next unit, which is, however, implicit in our models.
@inzaghi250 Do you mean we should use y_t directly into the next layer instead of adding an activation function like 'tanh' ? I'm also confused about that. But from the experiments I did, I thought the result and loss are both acceptable...
@Slyne how to parse the output to the new network added with softmax? Is it still in the format of [length_output, num_classes]?
And did you add softmax
in the seq2seq source code or we may directly append softmax?
@ChristopherLu yes. you can add TimeDistributed(Dense(vocab_size)) and then Softmax activation directly, no need to change the source code.
@Slyne It runs, thx.
Just to confirm, in the case we added dense and softmax after seq2seq, how do you set the the parameters of output_length
and output_dim
in Seq2Seq(..., output_length=tar_maxlen, output_dim=output_dim, ...)
?
Is it output_dim
== vocab_size
?
@Slyne it works fine ,thanks a lot
@martin3252 see PR #199 which let's you set the output activation layer.
@Slyne thanks for the recommendation.
@farizrahman4u should softmax become the default? Or something else to make it less likely folks will run into this problem when using seq2seq?
@martin3252 @Slyne @gibrown
After applying model.add(TimeDistributed(Dense(output_dim)))
and model.add(Activation('softmax'))
of AttentionSeq2Seq, did you guys ever meet the problem of gradient vanishing (on tensorflow backedn, and on I tried on theano as well, it gives me the training loss of NaN....) as follows:
Epoch 3/100 20400/20455 [============================>.] - ETA: 1s - loss: 1.4559 - acc: 0.5113 Epoch 00002: val_acc improved from 0.52442 to 0.53723, saving model to ./tmp/models/model_entropy_16_0.005_100_20455_0.1_2.hdf5 20455/20455 [==============================] - 712s - loss: 1.4557 - acc: 0.5114 - val_loss: 1.4287 - val_acc: 0.5372 Epoch 4/100 20400/20455 [============================>.] - ETA: 1s - loss: 0.6295 - acc: 0.4340 Epoch 00003: val_acc did not improve 20455/20455 [==============================] - 707s - loss: 0.6278 - acc: 0.4339 - val_loss: 0.0000e+00 - val_acc: 0.3672 Epoch 5/100 20400/20455 [============================>.] - ETA: 1s - loss: 0.0000e+00 - acc: 0.3610 Epoch 00004: val_acc did not improve 20455/20455 [==============================] - 701s - loss: 0.0000e+00 - acc: 0.3610 - val_loss: 0.0000e+00 - val_acc: 0.3672 Epoch 6/100 20400/20455 [============================>.] - ETA: 1s - loss: 0.0000e+00 - acc: 0.3610 Epoch 00005: val_acc did not improve
Is it because the learning rate of my adam opt is not setting to the right (currently setting to 0.005)?
@ChristopherLu which loss function you tried? I used categorical_crossentropy and it worked well..
@Slyne
I used model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
After setting the learning_rate of Adam as default, the NaN problem gets alleviated, however, it still comes up after a hundred epochs.
Epoch 140/200 20400/20455 [============================>.] - ETA: 0s - loss: 1.3045 - acc: 0.5588Epoch 00139: val_acc did not improve 20455/20455 [==============================] - 82s - loss: 1.3046 - acc: 0.5587 - val_loss: 1.5386 - val_acc: 0.5147 Epoch 141/200 20400/20455 [============================>.] - ETA: 0s - loss: 1.4245 - acc: 0.5226Epoch 00140: val_acc did not improve 20455/20455 [==============================] - 82s - loss: 1.4244 - acc: 0.5226 - val_loss: 1.4046 - val_acc: 0.5257 Epoch 142/200 20400/20455 [============================>.] - ETA: 0s - loss: nan - acc: 0.4012Epoch 00141: val_acc did not improve 20455/20455 [==============================] - 81s - loss: nan - acc: 0.4012 - val_loss: nan - val_acc: 0.3672
And this problem is more obvious when I use the AttentionSeq2Seq
, general Seq2Seq
works well with no NaN interruption.
@Slyne Hi, I am troubling about add TimeDistributed(Dense(vocab_size)) and Softmax activation to the model to use the categorical_crossentropy loss. Would you please tell me how to add the above layers? modify models.py, append it to the end of decoder? OR modify cells.py?
@Slyne Hi, I am troubling about add TimeDistributed(Dense(vocab_size)) and Softmax activation to the model to use the categorical_crossentropy loss. Would you please tell me how to add the above layers? modify models.py, append it to the end of decoder? OR modify cells.py?
Need NOT modify source codes, just Add after the Seq2Seq line in your project file.
Hi guys,i am a beginner, i have a question here when i used categorical_crossentropy , I encountered a "loss : nan" at first epoch, I don't know why, can someone help me? thanks
model = AttentionSeq2Seq(hidden_dim=hidden_dim, output_length=tar_maxlen, output_dim=output_dim,input_shape=(input_maxlen, word_vector_size)) model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
X = np.zeros((len(inside_data)/2, input_maxlen, word_vector_size), dtype=np.float) Y = np.zeros((len(inside_data)/2, tar_maxlen,vocab_size),dtype=np.bool)
s_index=0
model.fit(X, Y, batch_size=batch_size, nb_epoch=5,verbose=1)`
the result : Epoch 1/5 1416/1416 [==============================] - 3s - loss: nan - acc: 0.7864
Epoch 2/5 1416/1416 [==============================] - 3s - loss: nan - acc: 0.8039
Epoch 3/5 1416/1416 [==============================] - 3s - loss: nan - acc: 0.8039
Epoch 4/5 1416/1416 [==============================] - 3s - loss: nan - acc: 0.8039
Epoch 5/5 1416/1416 [==============================] - 3s - loss: nan - acc: 0.8039