Parallel (batched) translation from different source languages

molokanov50 commented 2 years ago

Hi team,

The opportunity of parallel translation (in a single batch) from different source languages is of a particular interest,. The current obstacle lies in the fact that the tokenizer depends on a single specified source language prior to make input embeddings. E.g., in the frame of transformers package, initially I should load a tokenizer with an indication of source language to use, tokenizer = AutoTokenizer.from_pretrained("nllb-200-3.3B", use_auth_token=True, src_lang='eng_Latn'), and only then, I can define a batch of inputs: inputs = tokenizer.prepare_seq2seq_batch(list_of_texts, return_tensors="pt").to(device).

Is there a possibility to organize the workflow in another manner, for example, to make batches of language-independent input embeddings feasible? Maybe somebody already had experience of real usage? Any proposals about code snippet to use?

gmryu commented 2 years ago

I am no expert and I merely want to share my point of view.

The actual question is how do you make a tokenizer judge in which language this sentence is written? There must be a ID token in front of that sentence, i.e. <en> I like apples., to tell a tokenizer the language. (or an exhaustive program to identify it by reading characters)

If it is so, you can actually load all tokenizer you need first, then switch tokenizers if a sentence's language is changed from the previous sentence's. Well, actually you can also collect sentences written in the same language to form a batch. Then you shuffle/sample from those batches to make a multilingual batch. This is the only way I believe.

molokanov50 commented 2 years ago

I found out everything i need to sample BatchEncoding objects as you said - input_ids and attention_mask attributes need to be sampled correctly. And on the whole, your methodics helped me a lot, thanks!

facebookresearch / fairseq

Parallel (batched) translation from different source languages #4716