Open csrednicki opened 6 months ago
In the official Seamless implementation (based on fairseq2), there is currently no recommended way to force the model to include some particular word in the output.
In the Hugginface transformers implementation, though, you could use positively constrained beam search (https://huggingface.co/blog/constrained-beam-search) to enforce that.
import torch
from transformers import SeamlessM4TForTextToText, SeamlessM4TTokenizer
model = SeamlessM4TForTextToText.from_pretrained("facebook/hf-seamless-m4t-medium").to('cuda')
tokenizer = SeamlessM4TTokenizer.from_pretrained(
"facebook/hf-seamless-m4t-medium", src_lang="eng", tgt_lang="pol"
)
# Default translation does not produce the special word (because the model was never trained this way!)
text = "This cat belongs to the <breed> domestic cat breed."
inputs = tokenizer(text, return_tensors='pt').to(model.device)
output_tokens = model.generate(**inputs, num_beams=5, tgt_lang="pol")
print(tokenizer.decode(output_tokens[0], skip_special_tokens=True))
# Ten kot należy do rasy kotów domowych.
# However, we can still force this word
force_words_ids = tokenizer(['<breed>'], add_special_tokens=False).input_ids
print(force_words_ids)
# [[45, 9653, 76, 248123]]
output_tokens = model.generate(**inputs, num_beams=5, tgt_lang="pol", force_words_ids=[force_words_ids])
print(tokenizer.decode(output_tokens[0], skip_special_tokens=True))
# Ten kot należy do rasy kotów domowych <breed>.
I don't speak Polish so I am not sure whether putting <breed>
to the end is grammatical. But a phrase like Ten kot należy do rasy kotów domowych Maine Coon.
intuitively seems reasonable.
Please note that force_words_ids
may work inadequately if num_beams
is smaller or equal than the number of tokens in the forced word.
So I have sentence like
This cat belongs to the <breed> domestic cat breed.
Currently during translation word<breed>
is removed.Current translation:
Ten kot należy do rasy kotów domowych.
My expactation:Ten kot należy do rasy <breed> kotów domowych.
Is there some special tag/token that I can use to save original data in translated output?
This is my code: