Attention on different input and output length

aayushee commented 6 years ago

Hello Thanks a lot for providing easy to understand tutorial and attention layer implementation. I am trying to use attention on a dataset with different input and output length. My training data sequence consists of size 6004 (600 4-dimensional points) and output one hot encoded is of size 7066 (66 symbols represented in a 70 length vector). I have to map the 600 points sequence to the 70 symbols for ~15000 such sequences. Just after applying LSTM layer, I tried using a Repeated Vector with the output length for a small dataset. I read that Repeated Vector is used in encoder decoder models where output and input sequence are not of same length. Here is what I tried: x_train.shape=(50,600,4) y_train.shape=(50,70,66) inputs = Input(shape=(x_train.shape[1:])) rnn_encoded = Bidirectional(LSTM(32, return_sequences=False),name='bidirectional_1',merge_mode='concat',trainable=True)(inputs) encoded = RepeatVector(y_train.shape[1])(rnn_encoded) y_hat = AttentionDecoder(70,name='attention_decoder_1',output_dim=y_train.shape[2], return_probabilities=False, trainable=True)(encoded)

But the prediction from this model always gives same symbols in the output sequence after every run: 'decoded model output:', ['d', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I']) ('decoded original output:', ['A', ' ', 'M', 'O', 'V', 'E', ' ', 't', 'o', ' ', 's', 't', 'o', 'p', ' ', 'M', 'r', ' ', '.', ' ', 'G', 'a', 'i', 't', 's', 'k', 'e', 'l', 'l', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0']) Could you please give an idea where I am going wrong and what can I possibly do to solve the problem? Any help would be much appreciated.

Thanks Aayushee

pgyrya commented 6 years ago

Hello, Aayushee -

I've practiced with this library a bit, ultimately I made it work for my practice project (though it does conflict with later version of keras installation as another issue with time distributed dense layer suggests). I have also experienced similar very-imperfect translations at some point when the model was not well tuned - but I was able to make it work eventually.

Notice that the first and second symbols in your translations are different, so your model is technically able to generate different translations. Perhaps the model has simply not learned the right translations yet? With long sequences, the parameter space of the model may be too complex (e.g. high curvature) to be learned quickly. I have chosen to stick with words rather than symbols for output encoding to shorten sequence length and facilitate learning.

Could you confirm what happens if you run the optimization further? Can you see loss function improving substantially as you tune the model? I suggest to use relatively small learning rate, and go through many iterations of gradient descend to see if you can notice improvement.

chungfu27 commented 6 years ago

Hi Aayushee, If you use "return_sequence=False" and RepeatVector, encoder lstm will always return the same hidden vector into the input of decoder lstm. Attention mechanism needs "return_sequence=True" to return the hidden vector from every timestep of encoder lstm and caculate the different weighted sum vector at each timestep in decoder lstm.

ghost commented 6 years ago

@chungfu27 If return_sequences are made true with repeat vector then you will be getting this error before passing to the decoder

ValueError: Input 0 is incompatible with layer repeat_vector_1: expected ndim=2, found ndim=3

Kaustubh1Verma commented 5 years ago

Yeah @chungfu27 .That's right,as said by ghost that making return sequence True,it won't be possible to use repeat vector which makes it incompatible for different lengths

NehaTamore commented 5 years ago

Yeah, @chungfu27. It doesn't make sense to make return_sequence false. But have we found any workaround to implement attention with different input and output lengths? I'm working on abstractive summarization and looks like we could concatenate zeros to match encoder_output and decoder_output length. will that work? Also, has anyone found possible reasons for repeating words/characters in the inference model? Many thanks!

datalogue / keras-attention

Attention on different input and output length #14