bartvm / nmt

Neural machine translation
MIT License
2 stars 2 forks source link

Generation of shorter sentences #48

Closed JinseokNam closed 8 years ago

JinseokNam commented 8 years ago

As you may know, the initial hidden state of the decoder RNN is a non-linear transformation of the information that the encoder RNN provides link.

The most simple ways of generating such information from the encoder are either averaging hidden states over time or taking only hidden state at the last time step. We use the first option link.

I found that both approaches yield similar validation cost as training progresses. However, generated samples are quite different. Here are samples.

{ "update_time": 1.6357389999902807, "train_time": 86440.602338, "validation_cost": 65.87356567382812, "iteration": 60000, "average_target_length": 29.774999618530273, "epoch": 1, "cost": 81.53584289550781, "average_source_length": 28, "samples": [ { "sample": "That is something about that but throughout key terms that have hitherto left off both fairly ", "source": "Das heißt , wir sollten hier sehr viel mehr zusammenarbeiten , und zwar sowohl innerhalb der Europäischen Union im Verhältnis zu den Mitgliedstaaten als auch im Verhältnis der Europäischen Union zu den Drittstaaten . ", "truth": "This means that there should be far more cooperation , both between the European Union and its Member States and between the European Union and third countries . " }, { "sample": "The Galerie ", "source": "Das Frühstücksbuffet wird zwischen 7 : 00 und 11 : 00 Uhr , am Wochenende zwischen 7 : 30 und 12 : 00 Uhr serviert . ", "truth": "The hotel serves a buffet breakfast each morning between 07 : 00 and 11 : 00 ( 07 : 30 and 12 : 00 at weekends ) . " }, { "sample": "That divides ", "source": "Wir kennzeichnen Ihre mit einer . Die Nummer ist diskret ins und steht außerdem auf einer Karte , die Sie beim Erwerb erhalten . ", "truth": "When you opt for the most exclusive Fritz Hansen furniture we have chosen to provide an extra protection by marking your new furniture with a unique number . " }, { "sample": "Be the President ", "source": "schriftlich . - Ich habe mich für die zwölf , in der enthaltenen Maßnahmen ausgesprochen und hoffe , dass die Kommission diese in wirksame legislative Maßnahmen umwandeln wird . ", "truth": "in writing . - I supported the twelve measures contained in the Single Market Act and hope that the Commission will translate this into effective legislative measures . " }, { "sample": "Deal ", "source": "Dadurch wird jedoch nicht das eigentliche Problem behoben , nämlich dass diese Politik in den Papierkorb gehört . ", "truth": "But this does not sort out the root of the problem , in that this is a policy that deserves to be consigned to the rubbish bin . " } ] }

The interesting point is that the decoder tends to generate very short samples, meaning that the EOS token is often sampled out in the almost beginning step if we initialize the decode RNN with the average of encoder hidden states. Also, it is also often the case that the generated samples are completely incorrect in terms of the meaning of source sentences.

If we switch to taking the last hidden state as a source of information for initializing decoder's hidden states, you may see the following samples.

{ "update_time": 1.8340170000010403, "train_time": 93079.082654, "validation_cost": 65.62248229980469, "iteration": 60000, "average_target_length": 29.774999618530273, "epoch": 1, "cost": 81.00658416748047, "average_source_length": 28, "samples": [ { "sample": "That means , so should be playing together much on here , for both within the European Union and with the Member States following the ", "source": "Das heißt , wir sollten hier sehr viel mehr zusammenarbeiten , und zwar sowohl innerhalb der Europäischen Union im Verhältnis zu den Mitgliedstaaten als auch im Verhältnis der Europäischen Union zu den Drittstaaten . ", "truth": "This means that there should be far more cooperation , both between the European Union and its Member States and between the European Union and third countries . " }, { "sample": "The buffet breakfast has terraces between 7 . Cancun and 11 : 00 12.00 ; a 7 . ", "source": "Das Frühstücksbuffet wird zwischen 7 : 00 und 11 : 00 Uhr , am Wochenende zwischen 7 : 30 und 12 : 00 Uhr serviert . ", "truth": "The hotel serves a buffet breakfast each morning between 07 : 00 and 11 : 00 ( 07 : 30 and 12 : 00 at weekends ) . " }, { "sample": "We endorse legacy of Saxony ", "source": "Wir kennzeichnen Ihre mit einer . Die Nummer ist diskret ins und steht außerdem auf einer Karte , die Sie beim Erwerb erhalten . ", "truth": "When you opt for the most exclusive Fritz Hansen furniture we have chosen to provide an extra protection by marking your new furniture with a unique number . " }, { "sample": "in writing . I voted out only in the recent Treaties as the other being ", "source": "schriftlich . - Ich habe mich für die zwölf , in der enthaltenen Maßnahmen ausgesprochen und hoffe , dass die Kommission diese in wirksame legislative Maßnahmen umwandeln wird . ", "truth": "in writing . - I supported the twelve measures contained in the Single Market Act and hope that the Commission will translate this into effective legislative measures . " }, { "sample": "Consequently , this has turned a bit to the report ", "source": "Dadurch wird jedoch nicht das eigentliche Problem behoben , nämlich dass diese Politik in den Papierkorb gehört . ", "truth": "But this does not sort out the root of the problem , in that this is a policy that deserves to be consigned to the rubbish bin . " } ] }

Such a discrepancy between two approaches seems to me quite strange. Please notice that the setup for both approaches is same except for the way of generating the fixed-size context of the encoder.

I would like to recommend you use the second option for generating the context vector fed into the decoder in your experiments until we know why this happens.

I'm going to write some scripts to translate German sentences and then compare two approaches in terms of BLEU.

bartvm commented 8 years ago

I didn't even know Cho changed that in this code...

"Last hidden state" is a bit confusing since we use a bidirectional RNN. Which one do you mean? If you ask me, we should use the last hidden state of the RNN that runs from right-to-left. Google saw a huge improvement when they switched the order of the input sentence (i.e. they read the source from right to left, and then produce the target left to right). The intuition here is that when generating the beginning of the sentence in the target language, you want to have the beginning of the source sentence fresh in memory.

JinseokNam commented 8 years ago

I recognized the second option that I mentioned in the code is slightly different from what I thought initially.

If I understood correctly, the second option feeds the last states of both left-to-right and right-to-left chains into the decoder, meaning that backward chain operates actually on a reverse sentence while the forward RNN summarizes the original sequence from left to right.

What the backward RNN does is the way that Google guys used in the paper.

bartvm commented 8 years ago

Just looked at the code, and yeah, it seems to use the last states of both RNNs. That's probably what you want to do. You definitely want the last state of the right-to-left encoder. The last hidden state of the left-to-right encoder might not be as useful, but your network can always learn to ignore it (assuming we can optimize it well).

I definitely agree we should switch defaults in master, but maybe wait a day or two so that the multi-GPU experiments that @anirudh9119 and I are running aren't affected.