How to add context to translation models?

eyalmazuz commented 1 year ago

I'm using meta's NLLB for translation, now I have a situation where I have a stream of sentences in language A that I want to translate to language B something akin to subtitles in movies or captions in a youtube video

right now I translate each sentence separately something like this

for sentence in sentences:
    sentence_tokenized = [tokenizer.convert_ids_to_tokens(tokenizer.encode(sentence))]
    translations = translator.translate_batch(sentence_tokenized, target_prefix=[target_language])
    translated_sentence = tokenizer.decode(tokenizer.convert_tokens_to_ids(translations[0].hypotheses[0][1:]))

the issue is that I lose information translating this way since obviously the sentences are correlated but I want to keep it as stream of sentences, since just like subtitles there's no reason to concat everything and put text in the size of 3 paragraphs of the video.

so I was thinking if there's a way that I could in theory could condition the translation on previous texts, something like:

prev_sentences = []
for sentence in sentences:
    sentence_tokenized = [tokenizer.convert_ids_to_tokens(tokenizer.encode(sentence))]
    translations = translator.translate_batch(sentence_tokenized, target_prefix=[target_language], context=prev_sentences)
    # or prev_sentences could be also the translated version.
    translated_sentence = tokenizer.decode(tokenizer.convert_tokens_to_ids(translations[0].hypotheses[0][1:]))
    prev_sentences.append(sentence)

I know that translate_batch has a target_prefix parameter, but my understanding is that the target_prefix must be part of translation, so for example if I enter sentence in language A: A4 A5 A6 A7 A8

and want to get translation in language B: B7 B8 B9 B10

then if target_prefix="B4 B5 B6" I need to supply the translator model text: A1 A2 A3 A4 A5 A6 A7 A8

which I don't want to necessarily do it since it'll create longer sequences and thus potentially slow the model since attention is quadratic.

edit1: I tried using target_prefix but the model suddenly can't even translate

edit2: I found that using return_alternatives=True fix this issue but it somewhat butchers the translation but the issue with return_alternatives is that in the documentation it says

Combining target_prefix with the return_alternatives flag returns alternative sequences just after the prefix:

so it was suppose to return translation without the prefix but I get the full sentence back including the prefix

I used the following file as sentences.txt

立ちまち、人玉の巻きができる
次の束に取り掛かりながら
彼女はふと蜜葉地のことを思う
この夏、ご箱あった素箱が、アカリンダリと
スムーシにやられて全滅したという
ダニモス済むしも蜜葉地にとっては強敵だが、 蜂の群れが強くなります。
ければ 全滅するようなことはまずない

and this is the script

import ctranslate2
import numpy as np
import transformers

source_language = 'jpn_Jpan'
target_language = 'eng_Latn'

with open('sentences.txt', 'r') as f:
    segments = [s.strip() for s in f.readlines()]

translator = ctranslate2.Translator("MAWT/models/nllb-200-distilled-600M", device="cpu", compute_type="int8")
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang=source_language)

context = []
full_sentence = ""
for segment in segments:
    full_sentence += ' ' + segment

    print(f"{segment=}")
    print(f"{full_sentence=}")

    sentence_tokenized = [tokenizer.convert_ids_to_tokens(tokenizer.encode(full_sentence))]
    target_prefix = [target_language] + tokenizer.convert_ids_to_tokens(context)
    print(f"{sentence_tokenized=}")
    print(f"{target_prefix=}")

    translations = translator.translate_batch(sentence_tokenized,
                                              target_prefix=[target_prefix],
                                              beam_size=4,
                                              num_hypotheses=4,
                                              min_alternative_expansion_prob=0.001,)

    tl_word_tokens = translations[0].hypotheses[0][1:]

    tl_tokens = tokenizer.convert_tokens_to_ids(tl_word_tokens)

    print(f"Context before updaing {context=}")
    context = tl_tokens
    print(f"Context post updaing  {context=}")

    translated_sentence = tokenizer.decode(tl_tokens)

    print(f'{translations=}')
    print(f'{tl_word_tokens=}')
    print(f'{tl_tokens=}')
    print(f'{translated_sentence=}')
    print('\n\n')

    full_sentence = segment

the output I get is:

segment='\ufeff立ちまち、人玉の巻きができる'
full_sentence=' \ufeff立ちまち、人玉の巻きができる'
sentence_tokenized=[['▁', '立ち', 'まち', '、', '人', '玉', 'の', '巻き', 'ができる', '</s>', 'jpn_Jpan']]
target_prefix=['eng_Latn']
Context before updaing context=[]
Context post updaing  context=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075]
translations=[TranslationResult(hypotheses=[['eng_Latn', '▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.']], scores=[], attention=[])]
tl_word_tokens=['▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.']
tl_tokens=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075]
translated_sentence='You can stand up and roll the balls.'

segment='次の束に取り掛かりながら'
full_sentence='\ufeff立ちまち、人玉の巻きができる 次の束に取り掛かりながら'
sentence_tokenized=[['▁', '立ち', 'まち', '、', '人', '玉', 'の', '巻き', 'ができる', '▁次の', '束', 'に取り', '掛', 'かり', 'ながら', '</s>', 'jpn_Jpan']]
target_prefix=['eng_Latn', '▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.']
Context before updaing context=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075]
Context post updaing  context=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075]
translations=[TranslationResult(hypotheses=[['eng_Latn', '▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.']], scores=[], attention=[])]
tl_word_tokens=['▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.']
tl_tokens=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075]
translated_sentence='You can stand up and roll the balls.'

segment='彼女はふと蜜葉地のことを思う'
full_sentence='次の束に取り掛かりながら 彼女はふと蜜葉地のことを思う'
sentence_tokenized=[['▁次の', '束', 'に取り', '掛', 'かり', 'ながら', '▁彼女は', 'ふ', 'と', '蜜', '葉', '地', 'のことを', '思う', '</s>', 'jpn_Jpan']]
target_prefix=['eng_Latn', '▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.']
Context before updaing context=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075]
Context post updaing  context=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075]
translations=[TranslationResult(hypotheses=[['eng_Latn', '▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.']], scores=[], attention=[])]
tl_word_tokens=['▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.']
tl_tokens=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075]
translated_sentence='You can stand up and roll the balls.'

segment='この夏、ご箱あった素箱が、アカリンダリと'
full_sentence='彼女はふと蜜葉地のことを思う この夏、ご箱あった素箱が、アカリンダリと'
sentence_tokenized=[['▁彼女は', 'ふ', 'と', '蜜', '葉', '地', 'のことを', '思う', '▁この', '夏', '、', 'ご', '箱', 'あった', '素', '箱', 'が', '、', 'ア', 'カ', 'リン', 'ダ', 'リ', 'と', '</s>', 'jpn_Jpan']]
target_prefix=['eng_Latn', '▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.']
Context before updaing context=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075]
Context post updaing  context=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075]
translations=[TranslationResult(hypotheses=[['eng_Latn', '▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.']], scores=[], attention=[])]
tl_word_tokens=['▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.']
tl_tokens=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075]
translated_sentence='You can stand up and roll the balls.'

segment='スムーシにやられて全滅したという'
full_sentence='この夏、ご箱あった素箱が、アカリンダリと スムーシにやられて全滅したという'
sentence_tokenized=[['▁この', '夏', '、', 'ご', '箱', 'あった', '素', '箱', 'が', '、', 'ア', 'カ', 'リン', 'ダ', 'リ', 'と', '▁ス', 'ムー', 'シ', 'に', 'や', 'られて', '全', '滅', 'したという', '</s>', 'jpn_Jpan']]
target_prefix=['eng_Latn', '▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.']
Context before updaing context=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075]
Context post updaing  context=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075]
translations=[TranslationResult(hypotheses=[['eng_Latn', '▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.']], scores=[], attention=[])]
tl_word_tokens=['▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.']
tl_tokens=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075]
translated_sentence='You can stand up and roll the balls.'

segment='ダニモス済むしも蜜葉地にとっては強敵だが、 蜂の群れが強くなります。'
full_sentence='スムーシにやられて全滅したという ダニモス済むしも蜜葉地にとっては強敵だが、 蜂の群れが強くなります。'
sentence_tokenized=[['▁ス', 'ムー', 'シ', 'に', 'や', 'られて', '全', '滅', 'したという', '▁ダ', 'ニ', 'モ', 'ス', '済', 'む', 'しも', '蜜', '葉', '地', 'にとっては', '強', '敵', 'だが', '、', '▁', '蜂', 'の', '群', 'れ', 'が', '強く', 'なります', '。', '</s>', 'jpn_Jpan']]
target_prefix=['eng_Latn', '▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.']
Context before updaing context=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075]
Context post updaing  context=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075, 50106, 40321, 5884, 248116, 279, 8857, 223299, 76, 3559, 811, 349, 1579, 983, 248079, 5884, 248116, 119, 9, 214009, 20523, 162883, 202, 349, 183109, 222, 33, 248079, 3605, 5884, 248116, 119, 56794, 15903, 14, 248075]
translations=[TranslationResult(hypotheses=[['eng_Latn', '▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.', '▁Even', '▁though', '▁they', "'", 've', '▁been', '▁wip', 'ed', '▁out', '▁by', '▁the', '▁Mo', 'ose', ',', '▁they', "'", 're', '▁a', '▁formid', 'able', '▁enemy', '▁to', '▁the', '▁honey', 'be', 'es', ',', '▁but', '▁they', "'", 're', '▁getting', '▁strong', 'er', '.']], scores=[], attention=[])]
tl_word_tokens=['▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.', '▁Even', '▁though', '▁they', "'", 've', '▁been', '▁wip', 'ed', '▁out', '▁by', '▁the', '▁Mo', 'ose', ',', '▁they', "'", 're', '▁a', '▁formid', 'able', '▁enemy', '▁to', '▁the', '▁honey', 'be', 'es', ',', '▁but', '▁they', "'", 're', '▁getting', '▁strong', 'er', '.']
tl_tokens=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075, 50106, 40321, 5884, 248116, 279, 8857, 223299, 76, 3559, 811, 349, 1579, 983, 248079, 5884, 248116, 119, 9, 214009, 20523, 162883, 202, 349, 183109, 222, 33, 248079, 3605, 5884, 248116, 119, 56794, 15903, 14, 248075]
translated_sentence="You can stand up and roll the balls. Even though they've been wiped out by the Moose, they're a formidable enemy to the honeybees, but they're getting stronger."

segment='ければ 全滅するようなことはまずない'
full_sentence='ダニモス済むしも蜜葉地にとっては強敵だが、 蜂の群れが強くなります。 ければ 全滅するようなことはまずない'
sentence_tokenized=[['▁ダ', 'ニ', 'モ', 'ス', '済', 'む', 'しも', '蜜', '葉', '地', 'にとっては', '強', '敵', 'だが', '、', '▁', '蜂', 'の', '群', 'れ', 'が', '強く', 'なります', '。', '▁', 'ければ', '▁全', '滅', 'するような', 'ことは', 'まず', 'ない', '</s>', 'jpn_Jpan']]
target_prefix=['eng_Latn', '▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.', '▁Even', '▁though', '▁they', "'", 've', '▁been', '▁wip', 'ed', '▁out', '▁by', '▁the', '▁Mo', 'ose', ',', '▁they', "'", 're', '▁a', '▁formid', 'able', '▁enemy', '▁to', '▁the', '▁honey', 'be', 'es', ',', '▁but', '▁they', "'", 're', '▁getting', '▁strong', 'er', '.']
Context before updaing context=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075, 50106, 40321, 5884, 248116, 279, 8857, 223299, 76, 3559, 811, 349, 1579, 983, 248079, 5884, 248116, 119, 9, 214009, 20523, 162883, 202, 349, 183109, 222, 33, 248079, 3605, 5884, 248116, 119, 56794, 15903, 14, 248075]
Context post updaing  context=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075, 50106, 40321, 5884, 248116, 279, 8857, 223299, 76, 3559, 811, 349, 1579, 983, 248079, 5884, 248116, 119, 9, 214009, 20523, 162883, 202, 349, 183109, 222, 33, 248079, 3605, 5884, 248116, 119, 56794, 15903, 14, 248075]
translations=[TranslationResult(hypotheses=[['eng_Latn', '▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.', '▁Even', '▁though', '▁they', "'", 've', '▁been', '▁wip', 'ed', '▁out', '▁by', '▁the', '▁Mo', 'ose', ',', '▁they', "'", 're', '▁a', '▁formid', 'able', '▁enemy', '▁to', '▁the', '▁honey', 'be', 'es', ',', '▁but', '▁they', "'", 're', '▁getting', '▁strong', 'er', '.']], scores=[], attention=[])]
tl_word_tokens=['▁You', '▁can', '▁stand', '▁up', '▁and', '▁roll', '▁the', '▁ball', 's', '.', '▁Even', '▁though', '▁they', "'", 've', '▁been', '▁wip', 'ed', '▁out', '▁by', '▁the', '▁Mo', 'ose', ',', '▁they', "'", 're', '▁a', '▁formid', 'able', '▁enemy', '▁to', '▁the', '▁honey', 'be', 'es', ',', '▁but', '▁they', "'", 're', '▁getting', '▁strong', 'er', '.']
tl_tokens=[3555, 2125, 7078, 1738, 540, 58407, 349, 32639, 248066, 248075, 50106, 40321, 5884, 248116, 279, 8857, 223299, 76, 3559, 811, 349, 1579, 983, 248079, 5884, 248116, 119, 9, 214009, 20523, 162883, 202, 349, 183109, 222, 33, 248079, 3605, 5884, 248116, 119, 56794, 15903, 14, 248075]
translated_sentence="You can stand up and roll the balls. Even though they've been wiped out by the Moose, they're a formidable enemy to the honeybees, but they're getting stronger."

guillaumekln commented 1 year ago

As far as I know, the NLLB models are mostly trained on single sentences so there is no reliable way to translate with a context.

You can try translating a paragraph at once but the model will tend to generate a single sentence in output.

Regarding target_prefix, it is used to force the beginning of a translation and is always included in the result.

vince62s commented 1 year ago

What you can try is to finetune NLLB with the Opensubtitle dataset using the "docify" transform. How to use the docify transform: https://github.com/OpenNMT/OpenNMT-py/blob/master/docs/source/FAQ.md#context--doc-aware-transform How to finetune NLLB: https://forum.opennmt.net/t/finetuning-and-curating-nllb-200-with-opennmt-py/5238/27

eyalmazuz commented 1 year ago

As far as I know, the NLLB models are mostly trained on single sentences so there is no reliable way to translate with a context.

You can try translating a paragraph at once but the model will tend to generate a single sentence in output.

Regarding target_prefix, it is used to force the beginning of a translation and is always included in the result.

I thought about concatenating 2 sentences and using the first as target_prefix, this way the model don't need to waste effort translating the first one, and when the second sentence is translation """conditionally""" on the first.

But as seem in the code example in my original post, there's an issue with this which makes the model output the same sentence it translated the first time over and over again.

Even when target_prefix and the input sentence itself are not containing anything that could be translated as the first sentence (in the list). seem like a bug (in my code or in the package) is there a chance I'm doing something wrong?

(updated the code and output to make it tiny bit less messy)

eyalmazuz commented 1 year ago

What you can try is to finetune NLLB with the Opensubtitle dataset using the "docify" transform. How to use the docify transform: https://github.com/OpenNMT/OpenNMT-py/blob/master/docs/source/FAQ.md#context--doc-aware-transform How to finetune NLLB: https://forum.opennmt.net/t/finetuning-and-curating-nllb-200-with-opennmt-py/5238/27

I'm not sure this is the solution I'm searching, maybe with further explanation is could be but opensubtitles is a very very large corpora but the subtitles are not aligned and the quality is quite bad (I tried using it in the past for another model that translates from Arabic to Hebrew and the results were like BLEU score of 5 at best)

and I want to keep the multilingual nature of NLLB being able to translate between any pair out of the 200 languages so fine tuning for a specific language with context-aware is not really what I'm hoping for...

vince62s commented 1 year ago

as Guillaume said, NLLB has been trained on single sentence to single sentence, so it will be far from optimal t try what you are trying to do. Even if you want to try this has to be done as follow: "source_sent1. source_sent2" is your source "target_sent1." is your target prefix and you feed both to the model to spit out the target_sent2. Having said that, you need to pay attention to the language prefix and source suffix.

On the other subject: finetuning does not mean losing previous knowledge, it's just giving new information so that it adapts. (same concept as Instruction finetuning of LLMs).

guillaumekln commented 1 year ago

Your code snippet looks wrong because you should remove the target prefix from the result and only use the last translation as a context to the next one.

However, you will then see that the last translation is often missing because the model usually wants to terminate the generation after the last punctuation mark from the prefix.

You can try setting min_decoding_length to len(target_prefix) + 1 to force the model to produce at least one token after the prefix.

this way the model don't need to waste effort translating the first on

The target prefix still needs to be forwarded through the decoder so the performance of translating with or without the target prefix is similar.

eyalmazuz commented 1 year ago

Your code snippet looks wrong because you should remove the target prefix from the result and only use the last translation as a context to the next one.

However, you will then see that the last translation is often missing because the model usually wants to terminate the generation after the last punctuation mark from the prefix.

You can try setting min_decoding_length to len(target_prefix) + 1 to force the model to produce at least one token after the prefix.

this way the model don't need to waste effort translating the first on

The target prefix still needs to be forwarded through the decoder so the performance of translating with or without the target prefix is similar.

sorry for taking your time,

I tried setting the min_decoding_length to len(target_prefix) + 1 and it did somewhat helped in this case

but when you mean should remove the target prefix from the result and only use the last translation as a context to the next one. do you mean adding something like:

context = tl_tokens[len(context):]

instead of:

context = tl_tokens

edit: link to pastebin with the new output: https://pastebin.com/bAs1afL5 (instead of spamming the message itself)

which takes only the new part of the translated sentence and sets it as the context? I just want to make sure.

the problem that "I'm trying to solve is that some languages are highly contextualized (like Japanese that I use in my examples) so translating sentence by sentence would give off translation rather than have some context.

as Guillaume said, NLLB has been trained on single sentence to single sentence, so it will be far from optimal t try what you are trying to do. Even if you want to try this has to be done as follow: "source_sent1. source_sent2" is your source "target_sent1." is your target prefix and you feed both to the model to spit out the target_sent2. Having said that, you need to pay attention to the language prefix and source suffix.

On the other subject: finetuning does not mean losing previous knowledge, it's just giving new information so that it adapts. (same concept as Instruction finetuning of LLMs).

I agree that finetuning doesn't mean losing previous knowledge, but if I take model like nllb and fine-tune for specific language, the weight updates has to come with a price, and this price will be slightly degraded performance for the languages I didn't finetune for, is that wrong?

and fine turning a model like nllb with context/doc aware will require me to have (200 choose 2) pairs of documents to us for training.

also tangent question, in the link you provided for finetuning, he defines src_prefix: "</s> eng_Latn" but shouldn't it be src_suffix: "</s> eng_Latn" instead? from what I see from the tokenizer in my example that it adds that to the end of the sentence

guillaumekln commented 1 year ago

context = tl_tokens[len(context):]

Yes, something like this.

also tangent question, in the link you provided for finetuning, he defines src_prefix: "</s> eng_Latn" but shouldn't it be src_suffix: "</s> eng_Latn" instead?

It should be:

src_prefix: "eng_Latn"
src_suffix: "</s>"

eyalmazuz commented 1 year ago

context = tl_tokens[len(context):]

Yes, something like this.

also tangent question, in the link you provided for finetuning, he defines src_prefix: "</s> eng_Latn" but shouldn't it be src_suffix: "</s> eng_Latn" instead?

It should be:
* `src_prefix: "eng_Latn"`

* `src_suffix: "</s>"`

why? In the output example of my code, the NLLB tokenizer puts the language code token after the

guillaumekln commented 1 year ago

See https://github.com/huggingface/transformers/pull/22313

eyalmazuz commented 1 year ago

See huggingface/transformers#22313

Ah I see, thank you very much, updated the transformers library and I get the new behavior

I think for now I'll close the issue, and if I'll have something I'll open a new issue if needed

Thank you both for your detailed answers

OpenNMT / CTranslate2

How to add context to translation models? #1213