improve the performance of Beam Search

xiaotaw commented 5 years ago

Instead of max length penalty, it's more reasonable to use current length penalty in _is_finished function. It improved the performance in my experiment.

minicheshire commented 5 years ago

Thanks! We'll merge your PR after some experiments.

Glaceon31 commented 5 years ago

The decoding process only finishes when all alive candidates can not outperform any finished candidate in any case (even when length is maximized and the probability of all subsequent words are 1). So we should use max length penalty here.

Strangely, I tried to decode with your code and expect some differences in the translation. However, it yields the same translation as the origin code on a test set containing 1000+ sentences.

xiaotaw commented 5 years ago

I tried on several test sets, and 99.8% of the translations are the same. The differences are:

some translations(~0.2%), using current length, are shorter than those using max length, with one or two words.
a few of the translations, using max length, end with repeating words. it wont when using current length.

In addition, using current length avoids decoding pads after eos. Decoding pad makes no difference to BLEU, while increases unnecessary calculation. using current length, in my point of view, is more reasonable.

Glaceon31 commented 5 years ago

This PR is an algorithmic change instead of a bug fix. To make THUMT a stable and effective NMT toolkit, we only accept algorithmic change which is theoretical sound and is widely applied in the MT community or achieves stable and significant improvement on a public dataset.

Beam search is proven to be an effective decoding algorithm. Using max length sticks to the origin beam search algorithm which attempts to find a translation with higher probability. Using current length results in a pruning of the search space and it never finds a better translation in terms of probability.

Sorry that we cannot merge this for now. If there are important references (e.g. published paper) related to the superiority of using current length, please inform me.

Thanks for your contribution to THUMT!

xiaotaw commented 5 years ago

Could you please show me the paper about the origin beam search algorithm? I've only read one paper about beam search, published by google. In GNMT's paper, lp(Y) = (5 + |Y|) ^ alpha / (5 + 1) ^ alpha, and here |Y| is the current length, not the max length, I think.

As far as I known, length penalty is introduced because the decoder tends to output shorter sentences due to the accumulation of negative log_prob as length increase. With length penalty, sentences of different lengths are taken into account. What if we divide log_prob by a fixed length penalty term, as using the max length penalty? Does it still work as a length normalization term?

Glaceon31 commented 5 years ago

THUMT already penalizes with current length (see line 69). Max length is only used in early stopping (line 151), it is not the final loss.

xiaotaw commented 5 years ago

Yes, you're right. _loop_fn: score = log_prob / current_length_penalty (line 69-70 ) _is_finished: score = log_prob / max_length_penalty (line 151-152)

To make it clear, let's look at a concrete example. Assume that we have trained a MT system, and it's able to translate upper case characters into uncased ones with beam_size = 1. For example, "A B C" -> "a b c".

Given an unknown source sentence: X, the MT system has already decodes out "a b c d". In step 5, it decodes out: ("eos", prob=0.5), ("e", prob=0.4).

If using current length penalty, decoder stops here. If using max length penalty, decoder may continues. As a longer sentence "a b c d e f g h i ... " may score higher due to a larger length penalty.

Both of the two ways make sense, and produce good translation in practice, as we just tested them for thousands of sentences. The max length penalty tends to output the sentence with the highest score under the beam_size restriction, while current length penalty tends to produce a good enough translation at a reasonable computation cost.

Well, it may be a practicable way that THUMT provides a option to use current length penalty or max length penalty.

Glaceon31 commented 5 years ago

There are also examples that max length is better: when P("y_{<t} yt y{t+1} y{t+2}"|x) > P("y{<t} yt "|x) > P("y{<t} yt y{t+1}"|x), using current length will not yield the best translation "y_{<t} yt y{t+1} y_{t+2}".

According to our step-by-step observation into the decoding process. The NMT loss dramatically increases when useless words are decoded. We should let the model access to more translations and determine the best one.

It is true that using current length can save decoding time. However, the pruning may harm the translation quality. A careful evaluation on the time cost and translation quality should be conducted before this trick can be used in practical systems.

xiaotaw commented 5 years ago

Here is my result on 11 data sets. Each set contains 1000+ sentences. No. 4 and No. 5 come from wmt17 (newstest2017). The exact BLEU of test sets range from 20~40. Translation was carried on GPU, and I only ran them once. But it's clear that using current lp saves a lot of time while almost does not affect BLEU.

Looking forward to your further evaluation.

test dataset	1	2	3	4	5	6	7	8	9	10	11
BLEU(current_lp)-BLEU(max_lp)	-0.04	0	0.01	0	0	0	0	0	0	0	0
Time(max_lp) / second	34	12	113	81	27	12	14	21	17	16	19
Time(current_lp) / second	27	10	76	74	24	9	12	20	15	16	15

Glaceon31 commented 5 years ago

We will test when we have enough resource

THUNLP-MT / THUMT

improve the performance of Beam Search #62