datalogue / keras-attention

Visualizing RNNs using the attention mechanism
https://medium.com/datalogue/attention-in-keras-1892773a4f22
GNU Affero General Public License v3.0
747 stars 243 forks source link

Vanishing Gradient Problem Occurred During Training #38

Open bright1993ff66 opened 5 years ago

bright1993ff66 commented 5 years ago

Hi, I am new to the attention mechanism and I found your codes, tutorials very helpful to beginners like me!

Currently, I am trying to use your attention decoder to do the sentiment analysis of the Sentiment140 Dataset. I have successfully constructed the following BiLSTM-with-attention model to split the positive and negative tweets:

def get_bi_lstm_with_attention_model(timesteps, features):
    input_shape = (timesteps, features)
    input = Input(shape=input_shape, dtype='float32')
    enc = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True, input_shape=input_shape),
                        merge_mode='concat', name='bidirectional_1')(input)
    y_hat = AttentionDecoder(units=100,output_dim=1, name='attention_decoder_1', activation='sigmoid')(enc)
    bilstm_attention_model = Model(inputs=input, outputs=y_hat)
    bilstm_attention_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    return bilstm_attention_model

However, when I use this model to fit my training data(which is a *128000050 matrix, batch_size=128, I definitely reshape the data first to (int(1280000/5), 5, 50) following the rule of the input_shape = (batch_size, time_steps, input_dim)), the accuracy is very low(around 50%). My BiLSTM without** attention model could at least reach 80% accuracy using the same hyperparameter settings. Hence, my question is: what's wrong with my current BiLSTM with Attention model? I think it should be a vanishing gradient problem. I would really appreciate it if anyone could give me some guidelines about how to deal with this issue. Thank you very much!!

bright1993ff66 commented 5 years ago

Quick Update:

I try to add the regularizers to both the kernel regularizer and activity regularizer, which are the following:

kernel_regularizer=regularizers.l2(0.01),
activity_regularizer=regularizers.l1(0.01),

And I also add the BatchNormalization after the BiLSTM encoder and change the activation function to Relu. The newest model is given below:

def get_bi_lstm_with_attention_model(timesteps, features):
    input_shape = (timesteps, features)
    input = Input(shape=input_shape, dtype='float32')
    enc = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True, input_shape=input_shape),
                        merge_mode='concat', name='bidirectional_1')(input)
    normalized =  BatchNormalization()(enc)
    y_hat = AttentionDecoder(units=100,output_dim=1, name='attention_decoder_1', activation='relu')(normalized)
    bilstm_attention_model = Model(inputs=input, outputs=y_hat)
    bilstm_attention_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    return bilstm_attention_model

But the accuracy still fluctuates around the 50%. Any help&insights would be appreciated!

zafarali commented 5 years ago

Hi!

You can explicitly check for vanishing gradient by inspecting the gradients, it does seem like your sequence length is long but not too long that and LSTM might fail, especially since the regular version works.

Do you have masking in your sequences?

On Wed, Dec 5, 2018 at 12:09 AM Bright Chang notifications@github.com wrote:

Quick Update:

I try to add the regularizers to both the kernel regularizer and activity regularizer, which are the following:

kernel_regularizer=regularizers.l2(0.01), activity_regularizer=regularizers.l1(0.01),

And I also add the BatchNormalization after the BiLSTM encoder and the updated model is:

def get_bi_lstm_with_attention_model(timesteps, features): input_shape = (timesteps, features) input = Input(shape=input_shape, dtype='float32') enc = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True, input_shape=input_shape), merge_mode='concat', name='bidirectional_1')(input) normalized = BatchNormalization()(enc) y_hat = AttentionDecoder(units=100,output_dim=1, name='attention_decoder_1', activation='relu')(normalized) bilstm_attention_model = Model(inputs=input, outputs=y_hat) bilstm_attention_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy']) return bilstm_attention_model

But the accuracy still fluctuates around the 50%. Any help&insights would be appreciated!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/datalogue/keras-attention/issues/38#issuecomment-444360837, or mute the thread https://github.com/notifications/unsubscribe-auth/AGAO_PucDjX6z3Iv_NYLz5VKxthhZe-mks5u11TwgaJpZM4ZB_bN .

Tlchan99 commented 5 years ago

Hi Zafarali, I got the same issue as bright1993ff66, which I use your attention decoder in Portuguese/Chinese and English/Chinese machine translation for sentence with different (e.g. English = max. 13 words) input and output sequence (e.g. Chinese = max. 10 words) length.. There is no performance increase with BiLSTM model at all in BLEU1/2/3/4 measurement after bunches of prolonged GPU calculations. It seems there is something wrong with the stepping that I am not sure from initial state if context vector is mapping correctly with correct encoder sequence time step and decoder sequence (with different maximum length) time step.
I've been start from scratch for month but cannot figure this out. Could you explain a bit more on the step functions on 1) how you do the get_initial_state 2) the step function?

Appreciate a lot if you could help to solve this complex puzzle. Thanks in advanced.

Tlchan99 commented 5 years ago

Hi Zafarali,

In detail, I translated Portuguese (max. 14 words with zero padding if shorter than 14) to Chinese (max. 15 words with padding), after a generic embedding later, the problem is the encoder output of BiLSTM later must go through a Repeatvector layer of sequence length = 15 (that is the target_timesteps length, not the input time step 14 or there will be an error prompt of expected a tensor of (15, 256), vs. (14, 256), a return_sequences = True in LSTM layer got the same error, 256 is the word dimension)

I think there is a mismatch of context vector, calculated by 14 timesteps (input words) in a sentence, not the actual 15 time steps that is passed to attention decoder layer as the decoder input (should be 14 time steps) for attention mechanism. So want to know how you count the initial state of y0, s0 and ytm, stm, ci to make sure they are looped correctly. I believe this is the reason why no or little performance increases with the attention decoder layer in my case. Appreciate if you could help or shed me some light and thanks.

Tom Chan.