browsermt / bergamot-translator

Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.
http://browser.mt
Mozilla Public License 2.0
341 stars 38 forks source link

Marian emits no EOS #235

Open jerinphilip opened 3 years ago

jerinphilip commented 3 years ago

Discovered when working with pivot translation, where the two pointer algorithm broke due to absence of an EOS in the intermediate.

Isolated an abort where-in it's visible that sentence-piece does not decode an EOS (possibly because marian never predicted it). More documentation will appear here.

9e2477d (#19)

jphilip@var:~/code/bergamot-translator/bergamot-translator-tests$ INTGEMM_CPUID=AVX2 /mnt/Storage/jphilip/bergamot-build/bergamot-test --bergamot-mode test-pivot --model-config-paths /home/jphilip/code/bergamot-translator/bergamot-translator-tests/models/enes/enes.student.tiny11/config.intgemm8bitalpha.yml.bergamot.yml /home/jphilip/code/bergamot-translator/bergamot-translator-tests/models/esen/esen.student.tiny11/config.intgemm8bitalpha.yml.bergamot.yml --cpu-threads 4 < data/simple/bergamot/input.txt
[2021-10-26 08:21:53] Error: No EOS in targetSentenceMappings
[2021-10-26 08:21:53] Error: Aborted from void marian::bergamot::ResponseBuilder::buildTranslatedText(marian::Histories&, marian::bergamot::Response&) in /home/jphilip/code/bergamot-translator/src/translator/response_builder.cpp:44

[CALL STACK]
[0x55aeb06890f9]                                                       + 0x1130f9
[0x55aeb0684e3d]                                                       + 0x10ee3d
[0x55aeb0683f0a]                                                       + 0x10df0a
[0x55aeb0691944]                                                       + 0x11b944
[0x55aeb0668f39]                                                       + 0xf2f39
[0x55aeb06942e4]                                                       + 0x11e2e4
[0x7f2bd4f7dd80]                                                       + 0xd0d80
[0x7f2bd46df6db]                                                       + 0x76db
[0x7f2bd4408a3f]    clone                                              + 0x3f

The above is running on var, I have not been able to reproduce this on CI running AVX2/AVX512 machines as seen in the link above.

jerinphilip commented 3 years ago
Input string: Free software integrated with an open-source web browser, such as Mozilla Firefox, will enable bottom-up adoption by non-experts, resulting in cost savings for private and public sector users who would otherwise procure translation or operate monolingually.
Output string: El software libre integrado con un navegador web de código abierto, como Mozilla Firefox, permitirá la adopción de abajo hacia arriba por parte de los no expertos, lo que resulta en ahorros de costos para los usuarios de los sectores público y privado que de otro modo adquirirían traducción o operarían de forma monolinguista.
Input string had words = 57
Words(69): 47 6714 1582 3577 30 32 14825 11178 110 2607 2 4882 5164 3 57 2397 961 4707 22568 3973 1465 3 7078 6 1291 2 14105 1309 9594 31 150 2 15 40 1444 3 75 16 3067 11 22233 2 2971 26 15 4004 2 15 1578 991 10 1977 16 2 747 1240 7973 1663 9235 49 21588 1663 2 348 8101 2154 783 2725 5
marian::Word::NONE 4294967295
marian::Word::ZERO 0
marian::Word::DEFAULT_EOS_ID 0
marian::Word::DEFAULT_UNK_ID 1
vocabs_.target()->getEosId() 0
The last word is: .
[2021-10-26 08:48:15] Error: No EOS in targetSentenceMappings

I had max length factor set to 1.2, this means 57*1.2 = 68.4 < 69. Hence it got truncated without EOS.

jerinphilip commented 3 years ago

Another mitigation for this is to configure such that we always get an EOS at target editing marian-dev source. Thus our strong assert in our pipeline that wordRanges and words we have coming in and out of translation always end in EOS is maintained.

https://github.com/browsermt/marian-dev/blob/master/src/translator/beam_search.cpp#L496-L510

XapaJIaMnu commented 3 years ago

I guess you can hack it around by manually adding EoS as a post-processing step but max length 1.2 is too low and will create issues when there's a lot of BPE segmentation. 1.5; 1.6 should be safer options.

jerinphilip commented 3 years ago

In agreement that 1.2 was too low. However, I received segfault of 128 -> 256 at 2.0 WNGT after which I believe Estonian is a really verbose language (:D), was a sentence containing a lot of medicine names. I consider this more of an engineering problem now after the WNGT20 incident where our code should be robust to this, not a hyperparameter choice one.

I guess you can hack it around by manually adding EoS as a post-processing step

Unfortunately, this is not as simple. If I insert token manually here at bergamot I don't have the alignment column for it, which breaks things ahead. Hence my thinking that solution should have marian be one with max-length - 1 non EOS tokens, pick EOS at max-length and provide me attention as alignments, slightly modifying existing beam-search.