Closed molokanov50 closed 2 years ago
I am no expert and I merely want to share my point of view.
The actual question is how do you make a tokenizer judge in which language this sentence is written?
There must be a ID token in front of that sentence, i.e. <en> I like apples.
, to tell a tokenizer the language. (or an exhaustive program to identify it by reading characters)
If it is so, you can actually load all tokenizer you need first, then switch tokenizers if a sentence's language is changed from the previous sentence's. Well, actually you can also collect sentences written in the same language to form a batch. Then you shuffle/sample from those batches to make a multilingual batch. This is the only way I believe.
I found out everything i need to sample BatchEncoding
objects as you said - input_ids
and attention_mask
attributes need to be sampled correctly.
And on the whole, your methodics helped me a lot, thanks!
Hi team,
The opportunity of parallel translation (in a single batch) from different source languages is of a particular interest,. The current obstacle lies in the fact that the tokenizer depends on a single specified source language prior to make input embeddings. E.g., in the frame of
transformers
package, initially I should load a tokenizer with an indication of source language to use,tokenizer = AutoTokenizer.from_pretrained("nllb-200-3.3B", use_auth_token=True, src_lang='eng_Latn')
, and only then, I can define a batch of inputs:inputs = tokenizer.prepare_seq2seq_batch(list_of_texts, return_tensors="pt").to(device)
.Is there a possibility to organize the workflow in another manner, for example, to make batches of language-independent input embeddings feasible? Maybe somebody already had experience of real usage? Any proposals about code snippet to use?