Closed eyalmazuz closed 1 year ago
As far as I know, the NLLB models are mostly trained on single sentences so there is no reliable way to translate with a context.
You can try translating a paragraph at once but the model will tend to generate a single sentence in output.
Regarding target_prefix
, it is used to force the beginning of a translation and is always included in the result.
What you can try is to finetune NLLB with the Opensubtitle dataset using the "docify" transform. How to use the docify transform: https://github.com/OpenNMT/OpenNMT-py/blob/master/docs/source/FAQ.md#context--doc-aware-transform How to finetune NLLB: https://forum.opennmt.net/t/finetuning-and-curating-nllb-200-with-opennmt-py/5238/27
As far as I know, the NLLB models are mostly trained on single sentences so there is no reliable way to translate with a context.
You can try translating a paragraph at once but the model will tend to generate a single sentence in output.
Regarding
target_prefix
, it is used to force the beginning of a translation and is always included in the result.
I thought about concatenating 2 sentences and using the first as target_prefix, this way the model don't need to waste effort translating the first one, and when the second sentence is translation """conditionally""" on the first.
But as seem in the code example in my original post, there's an issue with this which makes the model output the same sentence it translated the first time over and over again.
Even when target_prefix
and the input sentence itself are not containing anything that could be translated as the first sentence (in the list). seem like a bug (in my code or in the package)
is there a chance I'm doing something wrong?
(updated the code and output to make it tiny bit less messy)
What you can try is to finetune NLLB with the Opensubtitle dataset using the "docify" transform. How to use the docify transform: https://github.com/OpenNMT/OpenNMT-py/blob/master/docs/source/FAQ.md#context--doc-aware-transform How to finetune NLLB: https://forum.opennmt.net/t/finetuning-and-curating-nllb-200-with-opennmt-py/5238/27
I'm not sure this is the solution I'm searching, maybe with further explanation is could be but opensubtitles is a very very large corpora but the subtitles are not aligned and the quality is quite bad (I tried using it in the past for another model that translates from Arabic to Hebrew and the results were like BLEU score of 5 at best)
and I want to keep the multilingual nature of NLLB being able to translate between any pair out of the 200 languages so fine tuning for a specific language with context-aware is not really what I'm hoping for...
as Guillaume said, NLLB has been trained on single sentence to single sentence, so it will be far from optimal t try what you are trying to do. Even if you want to try this has to be done as follow: "source_sent1. source_sent2" is your source "target_sent1." is your target prefix and you feed both to the model to spit out the target_sent2. Having said that, you need to pay attention to the language prefix and source suffix.
On the other subject: finetuning does not mean losing previous knowledge, it's just giving new information so that it adapts. (same concept as Instruction finetuning of LLMs).
Your code snippet looks wrong because you should remove the target prefix from the result and only use the last translation as a context to the next one.
However, you will then see that the last translation is often missing because the model usually wants to terminate the generation after the last punctuation mark from the prefix.
You can try setting min_decoding_length
to len(target_prefix) + 1
to force the model to produce at least one token after the prefix.
this way the model don't need to waste effort translating the first on
The target prefix still needs to be forwarded through the decoder so the performance of translating with or without the target prefix is similar.
Your code snippet looks wrong because you should remove the target prefix from the result and only use the last translation as a context to the next one.
However, you will then see that the last translation is often missing because the model usually wants to terminate the generation after the last punctuation mark from the prefix.
You can try setting
min_decoding_length
tolen(target_prefix) + 1
to force the model to produce at least one token after the prefix.this way the model don't need to waste effort translating the first on
The target prefix still needs to be forwarded through the decoder so the performance of translating with or without the target prefix is similar.
sorry for taking your time,
I tried setting the min_decoding_length
to len(target_prefix) + 1
and it did somewhat helped in this case
but when you mean should remove the target prefix from the result and only use the last translation as a context to the next one.
do you mean adding something like:
context = tl_tokens[len(context):]
instead of:
context = tl_tokens
edit: link to pastebin with the new output: https://pastebin.com/bAs1afL5 (instead of spamming the message itself)
which takes only the new part of the translated sentence and sets it as the context? I just want to make sure.
the problem that "I'm trying to solve is that some languages are highly contextualized (like Japanese that I use in my examples) so translating sentence by sentence would give off translation rather than have some context.
as Guillaume said, NLLB has been trained on single sentence to single sentence, so it will be far from optimal t try what you are trying to do. Even if you want to try this has to be done as follow: "source_sent1. source_sent2" is your source "target_sent1." is your target prefix and you feed both to the model to spit out the target_sent2. Having said that, you need to pay attention to the language prefix and source suffix.
On the other subject: finetuning does not mean losing previous knowledge, it's just giving new information so that it adapts. (same concept as Instruction finetuning of LLMs).
I agree that finetuning doesn't mean losing previous knowledge, but if I take model like nllb and fine-tune for specific language, the weight updates has to come with a price, and this price will be slightly degraded performance for the languages I didn't finetune for, is that wrong?
and fine turning a model like nllb with context/doc aware will require me to have (200 choose 2) pairs of documents to us for training.
also tangent question, in the link you provided for finetuning, he defines src_prefix: "</s> eng_Latn"
but shouldn't it be src_suffix: "</s> eng_Latn"
instead?
from what I see from the tokenizer in my example that it adds that to the end of the sentence
context = tl_tokens[len(context):]
Yes, something like this.
also tangent question, in the link you provided for finetuning, he defines
src_prefix: "</s> eng_Latn"
but shouldn't it besrc_suffix: "</s> eng_Latn"
instead?
It should be:
src_prefix: "eng_Latn"
src_suffix: "</s>"
context = tl_tokens[len(context):]
Yes, something like this.
also tangent question, in the link you provided for finetuning, he defines
src_prefix: "</s> eng_Latn"
but shouldn't it besrc_suffix: "</s> eng_Latn"
instead?It should be:
* `src_prefix: "eng_Latn"` * `src_suffix: "</s>"`
why? In the output example of my code, the NLLB tokenizer puts the language code token after the
Ah I see, thank you very much, updated the transformers library and I get the new behavior
I think for now I'll close the issue, and if I'll have something I'll open a new issue if needed
Thank you both for your detailed answers
I'm using meta's NLLB for translation, now I have a situation where I have a stream of sentences in language A that I want to translate to language B something akin to subtitles in movies or captions in a youtube video
right now I translate each sentence separately something like this
the issue is that I lose information translating this way since obviously the sentences are correlated but I want to keep it as stream of sentences, since just like subtitles there's no reason to concat everything and put text in the size of 3 paragraphs of the video.
so I was thinking if there's a way that I could in theory could condition the translation on previous texts, something like:
I know that translate_batch has a
target_prefix
parameter, but my understanding is that the target_prefix must be part of translation, so for example if I enter sentence in language A: A4 A5 A6 A7 A8and want to get translation in language B: B7 B8 B9 B10
then if
target_prefix="B4 B5 B6"
I need to supply the translator model text: A1 A2 A3 A4 A5 A6 A7 A8which I don't want to necessarily do it since it'll create longer sequences and thus potentially slow the model since attention is quadratic.
edit1: I tried using
target_prefix
but the model suddenly can't even translateedit2: I found that using
return_alternatives=True
fix this issue but it somewhat butchers the translation but the issue withreturn_alternatives
is that in the documentation it saysso it was suppose to return translation without the prefix but I get the full sentence back including the prefix
I used the following file as
sentences.txt
and this is the script
the output I get is: