Closed JinseokNam closed 8 years ago
I didn't even know Cho changed that in this code...
"Last hidden state" is a bit confusing since we use a bidirectional RNN. Which one do you mean? If you ask me, we should use the last hidden state of the RNN that runs from right-to-left. Google saw a huge improvement when they switched the order of the input sentence (i.e. they read the source from right to left, and then produce the target left to right). The intuition here is that when generating the beginning of the sentence in the target language, you want to have the beginning of the source sentence fresh in memory.
I recognized the second option that I mentioned in the code is slightly different from what I thought initially.
If I understood correctly, the second option feeds the last states of both left-to-right and right-to-left chains into the decoder, meaning that backward chain operates actually on a reverse sentence while the forward RNN summarizes the original sequence from left to right.
What the backward RNN does is the way that Google guys used in the paper.
Just looked at the code, and yeah, it seems to use the last states of both RNNs. That's probably what you want to do. You definitely want the last state of the right-to-left encoder. The last hidden state of the left-to-right encoder might not be as useful, but your network can always learn to ignore it (assuming we can optimize it well).
I definitely agree we should switch defaults in master
, but maybe wait a day or two so that the multi-GPU experiments that @anirudh9119 and I are running aren't affected.
As you may know, the initial hidden state of the decoder RNN is a non-linear transformation of the information that the encoder RNN provides link.
The most simple ways of generating such information from the encoder are either averaging hidden states over time or taking only hidden state at the last time step. We use the first option link.
I found that both approaches yield similar validation cost as training progresses. However, generated samples are quite different. Here are samples.
The interesting point is that the decoder tends to generate very short samples, meaning that the EOS token is often sampled out in the almost beginning step if we initialize the decode RNN with the average of encoder hidden states. Also, it is also often the case that the generated samples are completely incorrect in terms of the meaning of source sentences.
If we switch to taking the last hidden state as a source of information for initializing decoder's hidden states, you may see the following samples.
Such a discrepancy between two approaches seems to me quite strange. Please notice that the setup for both approaches is same except for the way of generating the fixed-size context of the encoder.
I would like to recommend you use the second option for generating the context vector fed into the decoder in your experiments until we know why this happens.
I'm going to write some scripts to translate German sentences and then compare two approaches in terms of BLEU.