Bug: In Luong Decoder - Githubissues

thakursc1 commented 4 years ago

If I understand the logic correctly then in Luong Decoders forward function:

def forward(self, inputs, hidden, encoder_outputs):
        #Embed input words
        embedded = self.embedding(inputs).view(1,1,-1)
        embedded = self.dropout(embedded)

        #Passing previous output word (embedded) and hidden state into LSTM cell
        lstm_out, hidden = self.lstm(embedded, hidden)

        #Calculating Alignment Scores - see Attention class for the forward pass function
        alignment_scores = self.attention(lstm_out,encoder_outputs)
        #Softmaxing alignment scores to obtain Attention weights
        attn_weights = F.softmax(alignment_scores.view(1,-1), dim=1)

        #Multiplying Attention weights with encoder outputs to get context vector
        context_vector = torch.bmm(attn_weights.unsqueeze(0),encoder_outputs)

        #Concatenating output from LSTM with context vector
        output = torch.cat((lstm_out, context_vector),-1)
        #Pass concatenated vector through Linear layer acting as a Classifier
        output = F.log_softmax(self.classifier(output[0]), dim=1)
        return output, hidden, attn_weights

alignment_scores = self.attention(lstm_out,encoder_outputs)

Shouldn't we pass hidden instead of lstm_output to self.attention ?

sayakpaul commented 4 years ago

Same question.

Also, the encoder outputs are passed while in the blog post, it's mentioned:

The first one is the dot scoring function. This is the simplest of the functions; to produce the alignment score we only need to take the hidden states of the encoder and multiply them by the hidden state of the decoder.

I see a potential disconnect here.

gabrielloye commented 4 years ago

Same question.

Also, the encoder outputs are passed while in the blog post, it's mentioned:

The first one is the dot scoring function. This is the simplest of the functions; to produce the alignment score we only need to take the hidden states of the encoder and multiply them by the hidden state of the decoder.

I see a potential disconnect here.

Hi @sayakpaul, sorry for the confusion, perhaps the naming convention may be misleading.

According to the docs, the PyTorch LSTM cell produces (output, (h_n, c_n)) :

The output is defined as the: output features (h_t) from the last layer of the LSTM, for each t. This means that the output from the forward call of the LSTM holds the hidden state at every time step of the input sequence. In our case, the encoder_outputs are actually the hidden states of every time step.
On the other hand, h_n is the hidden state for t = seq_len This indicates that the h_n output only holds the hidden state of the last time step.

Therefore when we call encoder_outputs, (h_n, c_n) = encoder(inp, h), we'll see that the outputs are of shapes:

encoder_ouputs: torch.Size([1, 4, 256])
h_n: torch.Size([1, 1, 256])

and if we run torch.equal(encoder_outputs[:, -1, :], h_n[0]), it will return true, which confirms that the last time encoder_outputs is essentially the same hidden state as h_n.

In our case, we are interested in the hidden states of the encoder LSTM at every time step when calculating our attention weights, therefore we'll have to use the output from the LSTM cell (which is essentially the hidden state of every time step) rather than the hidden.

If I understand the logic correctly then in Luong Decoders forward function:

alignment_scores = self.attention(lstm_out,encoder_outputs)

Shouldn't we pass hidden instead of lstm_output to self.attention ?

Hi @thakursc1, I apologise for the confusion as well.

Yes, the hidden state out the LSTM should be passed into the attention module and I've modified my code to be self.attention(hidden[0], encoder_outputs). I believe the reason that it still worked was that the lstm_output is essentially the hidden states of every time step of the LSTM as explained above, and since the sequence length passed into the LSTM is 1, the lstm_out and hidden carry the same values. (If we run torch.equal(hidden[0], lstm_out) in the forward call of the Luong Decoder, we'll see that the values are the same).

sayakpaul commented 4 years ago

Thanks for clarifying it with such details, @gabrielloye. If you could modify your already great blog post with these commentaries, I think it would be more complete.

gabrielloye / Attention_Seq2seq-Translation

Bug: In Luong Decoder #3