Open jelmervdl opened 2 years ago
Thank you for investigating!
The de-en model does seem to generate Ã3
garbage and probably needs to have its training data investigated. That said, to this particular issue my reaction is we should add the 0xad filter to the training pipeline as well. And for here strip them out.
I agree with your preference for 3. To your issues:
So in summary agree with removal from the input at the text layer (not HTML layer).
@jelmervdl I believe it will be helpful to immediately see the symptom so I have a better understanding of the problem. Screenshots could help here - could you edit to attach a few? I'm assuming you spotted this while experimenting with extension. In the long run, does it make sense to have the latest build for https://github.com/jelmervdl/firefox-translations/releases to download for use in Firefox Nightly and check?
There is no way to restore the soft hyphens in the translation output. But I would say we're okay with this. Even if we go with 1, there is no hard guarantee that the model will restore the soft hyphens in the right places.
Is it possible to consider the hyphen a placeholder and then transfer it onto the target text utilizing control token replacement if we look into retraining eventually? Perhaps there could be a "right place" for specific language pairs if SentencePiece tokenization tends to behave similarly? On the other hand, could this create more trouble than what it's worth?
While not strictly related to this issue, I think in the long run it'll help the project to compile a "tricks of the trade: what to do and what-not" for translation models that suit our use-case (more generally real world than a WMT competition) with the consensus on this issue being an example. In my experience, chasing BLEU scores deteriorates real-world experience beyond a specific limit. Aggressive cleaning also leads to a model's inability to process unknowns into the right places, which could be helpful in code-mixed scenarios.
Extract from Wikipedia trying to visualise it. Left is the text block as it appears on the German wikipedia frontpage. The middle is with all characters matching [^A-Za-z0-9-\s]
replaced with underscore. You can see …-Festival
in the last bullet point become …-Fes_ti_val
. Those _
are the (normally hidden) soft hyphens. On the right is the output of bergamot-translator with the current en-de model. The Ã3
garbage coincides with the places where soft hyphens used to be.
Is it possible to consider the hyphen a placeholder and then transfer it onto the target text utilizing control token replacement if we look into retraining eventually?
I think that will be very difficult. That would rely on alignment being so precise that it can guide these particular soft hyphens to syllables, and the vocabulary would also have to allow for it. I don't think it is worth the effort to focus on that. Especially not short term.
Thank you for the screenshots.
That would rely on alignment being so precise that it can guide these particular soft hyphens to syllables, and the vocabulary would also have to allow for it.
To be clear, I'm talking about copying from source here treating it similar to unknowns, the vocabulary requirement is just the soft hyphen mapping to <unk>
or perhaps another control token.
I don't think it is worth the effort to focus on that. Especially not short term.
I concur.
By the way, probably the cleanest way to handle eliminating the character is to add it to a custom sentencepiece normalization which then gets baked into the .spm and requires no C++ code change. If we can figure out how to retrofit our spm files...
I think I was able to fix this by editing the vocab file:
AD # soft hyphen
(that's two tab characters between AD
and #
) See normalization.md for details.spm_train --normalization_rule_tsv=nmt_nfkc_with_softhyphen.tsv --model_prefix=model --input=just-some-text.txt --vocab_size=30
(none of the other options really matter, we'll not use the actual vocab)precompiledCharsmap
from the resulting model.model
file into vocab.deen.spm
. I used this python script I found on Github. The precompiledCharsmap
is almost at the end of the file. It should be easy to write a little Python script that does this. Awesome
Wikipedia uses
0xad
unicode characters inside words to help word wrapping, but our translation models generally don't like that. For de->en it results in a lot of seemingly randomÃ3
occurring in text.Examples: https://de.wikipedia.org/wiki/Wikipedia:Hauptseite
I've only seen this happen on Wikipedia so far.
How should we handle these?
A few options:
My preference is 3 (hence me posting this issue here) but I see some "issues" with it.