facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.6k stars 6.41k forks source link

How can I avoid duplicating some tokens during translating? #4639

Open lena-kru opened 2 years ago

lena-kru commented 2 years ago

I try to translate some texts and sometimes I get really unexpected things.

For example, I try to translate that text

text = """ самописное по\nдобрый день, просьба добавить в исключение файл (прикреплен). возможности изменить самописное по нет.""" """

And it gives me

самописное по
добрый день, просьба добавить в исключение файл (прикреплен).  --->  I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, and I'm sorry, and I'm sorry, but I'm sorry, but I'm sorry, and I'm sorry to be so sorry
возможности изменить самописное по нет.  --->  I'm not sure I can change the self-publishing.

Code for reproducing it:

from nltk.tokenize.punkt import PunktSentenceTokenizer
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline

sent_tokenizer = PunktSentenceTokenizer()
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")

translation_pipeline = pipeline(
    "translation",
    model=model,
    tokenizer=tokenizer,
    src_lang="rus_Cyrl",
    tgt_lang="eng_Latn",
    max_length=5000,
)

for sent in sent_tokenizer.tokenize(text):
    print(sent, ' ---> ', translation_pipeline(sent)[0]['translation_text'])

Python version: 3.8.13 transformers: 4.21.1

Fikavec commented 2 years ago

@ldevyataykina read https://huggingface.co/blog/how-to-generate and try change num_beams, no_repeat_ngram_size and other parameters from article.

translation_pipeline = pipeline( "translation", model=model, tokenizer=tokenizer, src_lang="rus_Cyrl", tgt_lang="eng_Latn", max_length=512, num_beams=5, )

for sent in sent_tokenizer.tokenize(text): print(sent, ' ---> ', translation_pipeline(sent)[0]['translation_text'])

Output:

самописное по добрый день, просьба добавить в исключение файл (прикреплен). ---> self-published on good day, please add the file to the exclusion (attached). возможности изменить самописное по нет. ---> I don't have the ability to change the self-publishing.

P.S. Don't set max_length more than 512 tokens.