browsermt / bergamot-translator

Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.
http://browser.mt
Mozilla Public License 2.0
340 stars 38 forks source link

0xad: handling soft hyphens #337

Open jelmervdl opened 2 years ago

jelmervdl commented 2 years ago

Wikipedia uses 0xad unicode characters inside words to help word wrapping, but our translation models generally don't like that. For de->en it results in a lot of seemingly random Ã3 occurring in text.

Examples: https://de.wikipedia.org/wiki/Wikipedia:Hauptseite

I've only seen this happen on Wikipedia so far.

How should we handle these?

A few options:

  1. Train models to deal with them
  2. Let any extension filter them out: the extension knows that we're dealing with unicode and the context of the text. And this might just be a site-specific thing that needs a site-specific fix.
  3. Let bergamot-translate always filter out soft hyphen characters.

My preference is 3 (hence me posting this issue here) but I see some "issues" with it.

  1. bergamot-translator does not assume the input is unicode at the moment. The translation models most likely do though, I assume most training data is utf-8 at this point. The WASM & Python bindings also assume this I think. So in practice less of a problem.
  2. There is no way to restore the soft hyphens in the translation output. But I would say we're okay with this. Even if we go with 1, there is no hard guarantee that the model will restore the soft hyphens in the right places.
  3. Does it need to be a toggle? (I'd say no, I doubt there will ever be a model that does handle the soft hyphen well)
kpu commented 2 years ago

Thank you for investigating!

The de-en model does seem to generate Ã3 garbage and probably needs to have its training data investigated. That said, to this particular issue my reaction is we should add the 0xad filter to the training pipeline as well. And for here strip them out.

I agree with your preference for 3. To your issues:

  1. The translation models do indeed assume utf-8 (and they should). I expect this will continue in perpetuity.
  2. I'm kind of ok with leaving them out. Definitely for the time being.
  3. No. If anything we should go back to the training pipeline and filter them out, IMO.

So in summary agree with removal from the input at the text layer (not HTML layer).

jerinphilip commented 2 years ago

Examples: de.wikipedia.org/wiki/Wikipedia:Hauptseite

@jelmervdl I believe it will be helpful to immediately see the symptom so I have a better understanding of the problem. Screenshots could help here - could you edit to attach a few? I'm assuming you spotted this while experimenting with extension. In the long run, does it make sense to have the latest build for https://github.com/jelmervdl/firefox-translations/releases to download for use in Firefox Nightly and check?

There is no way to restore the soft hyphens in the translation output. But I would say we're okay with this. Even if we go with 1, there is no hard guarantee that the model will restore the soft hyphens in the right places.

Is it possible to consider the hyphen a placeholder and then transfer it onto the target text utilizing control token replacement if we look into retraining eventually? Perhaps there could be a "right place" for specific language pairs if SentencePiece tokenization tends to behave similarly? On the other hand, could this create more trouble than what it's worth?

While not strictly related to this issue, I think in the long run it'll help the project to compile a "tricks of the trade: what to do and what-not" for translation models that suit our use-case (more generally real world than a WMT competition) with the consensus on this issue being an example. In my experience, chasing BLEU scores deteriorates real-world experience beyond a specific limit. Aggressive cleaning also leads to a model's inability to process unknowns into the right places, which could be helpful in code-mixed scenarios.

jelmervdl commented 2 years ago

example Extract from Wikipedia trying to visualise it. Left is the text block as it appears on the German wikipedia frontpage. The middle is with all characters matching [^A-Za-z0-9-\s] replaced with underscore. You can see …-Festival in the last bullet point become …-Fes_ti_val. Those _ are the (normally hidden) soft hyphens. On the right is the output of bergamot-translator with the current en-de model. The Ã3 garbage coincides with the places where soft hyphens used to be.

Is it possible to consider the hyphen a placeholder and then transfer it onto the target text utilizing control token replacement if we look into retraining eventually?

I think that will be very difficult. That would rely on alignment being so precise that it can guide these particular soft hyphens to syllables, and the vocabulary would also have to allow for it. I don't think it is worth the effort to focus on that. Especially not short term.

jerinphilip commented 2 years ago

Thank you for the screenshots.

That would rely on alignment being so precise that it can guide these particular soft hyphens to syllables, and the vocabulary would also have to allow for it.

To be clear, I'm talking about copying from source here treating it similar to unknowns, the vocabulary requirement is just the soft hyphen mapping to <unk> or perhaps another control token.

I don't think it is worth the effort to focus on that. Especially not short term.

I concur.

kpu commented 2 years ago

By the way, probably the cleanest way to handle eliminating the character is to add it to a custom sentencepiece normalization which then gets baked into the .spm and requires no C++ code change. If we can figure out how to retrofit our spm files...

jelmervdl commented 2 years ago

I think I was able to fix this by editing the vocab file:

  1. Take the default character normalisation table that was used when training the sentencepiece model, i.e. nmt_nfkc.tsv
  2. Add an entry for the soft hyphen: AD # soft hyphen (that's two tab characters between AD and #) See normalization.md for details.
  3. Train some fake vocab with that file: spm_train --normalization_rule_tsv=nmt_nfkc_with_softhyphen.tsv --model_prefix=model --input=just-some-text.txt --vocab_size=30 (none of the other options really matter, we'll not use the actual vocab)
  4. Copy the precompiledCharsmap from the resulting model.model file into vocab.deen.spm. I used this python script I found on Github. The precompiledCharsmap is almost at the end of the file. It should be easy to write a little Python script that does this. image
  5. Tried it out in TranslateLocally: fixed Notice the lack of hyphens in words in the translation :)
kpu commented 2 years ago

Awesome

jelmervdl commented 2 years ago

Now in script form: https://gist.github.com/jelmervdl/712ba7a4ed663ce62d43e6f902a7254e#file-update-py