facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.59k stars 6.41k forks source link

NLLB is unable to translate into a complete long sentence in Chinese. #5549

Open logicvv opened 1 month ago

logicvv commented 1 month ago

🐛 Bug

Hi, I tried to test nllb for translating some English sentences to Chinese, and all my sentences are less than 60 tokens. However, most of sentences which more than 30 tokens cannot be generated completely, only half or less part of them can be done.

I also tried the same code, but English to French, it works. All sentences can be generated completly.

I also setted min_length, but sometimes, if I got short sentence, the last part of sentence will be compeately generated. My code is here, please help:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained( r"nllb-200-distilled-600M", token=True, src_lang="eng_Latn" ) model = AutoModelForSeq2SeqLM.from_pretrained(r"nllb-200-distilled-600M", token=True)

input_path = r"eng_test_short.txt" output_path = "./nllb_chn.txt"

input_file = open(input_path,'r',encoding='utf-8')

with open(output_path,'w',encoding='utf-8')as f: for article in input_file: inputs = tokenizer(article, return_tensors="pt")

print(article)

    # print(inputs)
    translated_tokens = model.generate(
        # **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=200
        **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("zho_Hans"), max_length=512

    )
    print(tokenizer.convert_tokens_to_ids("zho_Hans"))

    output = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True,model_max_length=512)[0]

    print(output)
    f.writelines(output+'\n')

The output would be like: input: Politicians are loath to raise the tax even one penny when gas prices are high. output: 政客们不愿意在高昂的燃油价格时,

LiPengtao0504 commented 1 month ago

I also encountered this problem. Src:"We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added.

Tgt:他补充道:“我们现在有4个月大没有糖尿病的老鼠,但它们曾经得过该病。”

Predict:他补充说:"我们现在有4个月的小鼠,