UKPLab / EasyNMT

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages
Apache License 2.0
1.17k stars 116 forks source link

html tags, proper nouns #35

Open tpilkati opened 3 years ago

tpilkati commented 3 years ago

First, I need to congratulate the team for your work, especially Nils is imho one of the best devs in the NLP community. Sentence-transformers and this NMT translator repo have been very helpful to us in Contents.com.

I use the Opus-mt right now it is great and I noticed that it even keeps hmtl tags. But sometimes it makes a mistake of generating ">/strong>", ×/strong> or --/strong> instead of (strong as example, same for other tags like h3 or li (as far as I know)). I solved it by simply replacing: text = text.replace('×/', '</').replace('>/', '</').replace('--/', '</'). It is a very simple thing, Just wanted to make you know.

I wonder if you are aware of a translator model that is very good at keeping the proper nouns like people, cities, company nouns ect unchanged, even when made by 2 or more words? I could use NER but it could decrease speed and tbh I don't find the free NER libraries enough reliable.

Thank you!

nreimers commented 3 years ago

Hi @tpilkati Thanks for the compliment. Happy to hear that you find the projects useful :)

Yes, the models were not trained with HTML tags. So it is no big surprise, that they fail here to generate valid HTML tags. I think it is best to remove HTML tags for translation and then later to add it again if needed.

I am not aware of such a model. The models learn based on the training data, and if proper nouns are translated there (e.g. France and Frankreich), then the model will learn it from there and translate these too. However, it has challenges when there are new proper nouns, there, the translation can sometimes be a bit odd.

Best Nils

glowinthedark commented 1 year ago

@nreimers: is there by chance a token or some kind of marker that can be used to mark untranslatable chunks?

I.e. something like:

<IGNORE>Dinsdale Piranha: </IGNORE>And now for something completely different!