asmekal / keras-monotonic-attention

seq2seq attention in keras
GNU Affero General Public License v3.0
40 stars 7 forks source link

Internal Embeddings #5

Closed ELind77 closed 5 years ago

ELind77 commented 6 years ago

Hi,

I think it's great that you enhanced the original model to allow teacher forcing. This is great work.

I'm not sure I understand the "internal embeddings" though. Can you explain how those are used?

To give some context, I am working with a Seq2Seq model that is very similar to a standard NMT model and I want to add Bahdanau style attention. My understanding is that at each time step the decoder should get the concatenation of the weighted sum of the decoder states (as created by the attention mechanism) and the embedding for the previous token (when using teacher forcing). And the output of the decoder at each time step is it's hidden state which is then fed into a softmax layer. But your code seems to be doing something a bit different. Can you elaborate on what's going on and/or correct my understanding?

-- Eric

asmekal commented 6 years ago

Your understanding of how it should work is absolutly correct. Actually it works in same way, but just in one layer - so AttentionDecoder, despite being a layer, is actually more like a small network.

It takes as input feature sequence (typically encoder outputs) and (if teacher forcing) true labels (which are token ids, and not vectors/embeddings). Then it produces the hidden state, which is immediately used to produce output logits and after sotfmax we get the actual output. So the whole process is done inside the layer, it does not return hidden state, it returns output token_id.

The next timestep output depends on the previous one, which is some token_id. The "internal embeddings" are used to get actual embedding from the previous output token or in case of teacher forcing from second input of the layer