UKPLab / EasyNMT

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages
Apache License 2.0
1.15k stars 111 forks source link

Some question about implementation of translate and translate_sentences in EasyNMT #7

Open svjack opened 3 years ago

svjack commented 3 years ago

Hi, i review the code, and want to give some suggestions. As the code logic describe, if user not set source_lang in translate method, the object will auto infer the possible source_lang in translate_sentences. This behaviour will have good conclusion when the input sentences are short, when it comes to long sentence, because the perform_sentence_splitting usage, will split the totally long input sentence into small fragments and do source_lang infer on evey fragments and choose a good model to translate it (in grouped_sentences group by detected source_lang) it will suit some mix language input sentences, translate different language fragments by different model and join them back. but when the sentence_splitter unfortunate split the long input in bad manner, consider following example:

import pandas as pd
input_ = 'How many times does the  rebuilt data contain cannot handle non-empty timestamp argument! 1929 and scrapped data contain cannot handle non-empty timestamp argument! 1954?'
#### this output will be ['en', 'en', 'eo'] because the last fragment is "1954?"
#### and language_detection map it to "eo"
pd.Series(sentence_splitter(input_)).map(model.language_detection)

And when use it in opus-mt model, to translate this sentence from "eo" into "zh" this will yield a error that not have this model to load. I understand that i can avoid this error by set source_lang to "en" in translate method. But i think also need deal with this problem. I think if the language_detection and sentence_splitter can run rapidly, can try to valid all possible translate-mapping in lang_pairs in easynmt.json in the opus-mt folder in the models dir before run translate method. Or becuse the last fragments is too short to give a good suggestion on language_detection, set a evidence filter on different length fragments. Or if you can use some regex (regular expression) to fllter out some symbols (in this example '?' in "1954?") or other bad tokens before input language_detection is more useful. And i think because the different format of our input, someone may input a html document into the translate method. It is useful to provide a interface let user set a token filter (filter "?" "<\br>" ) before language_detection.

The above example is one sample i use to translate from a dataset, so when this error occured , i lost all previous translated conclusion because one Exception yield . Because the truly translate is running in batch manner, People may want to maintain some success batches by set a small batch_size and a collection to collect success batches. I hope the future version will support collect success batches and not let all lost in the final output of translate method a long (in list measure or string measure) input of documents.

nreimers commented 3 years ago

This is a good point, I will add a document wide language detection to it.

Regarding translation of long inputs: I can recommend to use the translate_stream method: https://github.com/UKPLab/EasyNMT/blob/main/examples/translation_streaming.py

It yields translated documents as soon as they are translated. These can then be written e.g. to a text file