Hallucination with numbers : NLLB English to Spanish translation

evilc3 commented 1 year ago

Incorrect translation when translating English to spanish.

When source input contains only numbers, no letters the translated output is completely incorrect. please see the examples below.

Model Used : NLLB-600m distilled.

Predicted Output

For input

print(predict('1'))       
output :  ['El 1'] 

print(predict('1 2 3 4 5 6 7 8 9 10')). 
output: ['2 3 4 5 6 7 8 9 10']

print(predict('102')) 
output :  ['El número de personas']

from google es-en translate : El número de personas -> the number of people

print(predict('6171-1231-1311-1231')) : 
['El número de personas que se encuentran en el mercado']
from google translate : El número de personas que se encuentran en el mercado -> The number of people in the market

###Slight improvement  output when words are provided as context.

print(predict('it\'s 1'))   
output ['Es un 1']

print(predict('we count number : 1 2 3 4 5 6 7 8 9 10')) 
output ['Cuentan el número: 1 2 3 4 5 6 7 8 9 10']

print(predict('hey its a 102'))
output: ['Oye, es un 102']

print(predict('your code is 6171-1231-1311-1231'))
output: ['Su código es 6171-1231-1311-1231']

##### Output from facebook/nllb-200-1.3B
print(predict('1'))
output : ['El 1 de']

print(predict('1 2 3 4 5 6 7 8 9 10'))
output: ['1 2 3 4 5 6 7 8 9 10']

print(predict('102'))
output: ['102 y']

print(predict('6171-1231-1311-1231'))
output: ['6171-1231-1311-1231 El número de personas']

print(predict('it\'s 1'))
output: ['Es 1']

print(predict('we count number : 1 2 3 4 5 6 7 8 9 10'))
output: ['contamos el número: 1 2 3 4 5 6 7 8 9 10']

print(predict('hey its a 102'))
output: ['Es un 102.']

print(predict('your code is 6171-1231-1311-1231'))
output: ['Su código es 6171-1231-1311-1231'

This problem seems to persist when using bigger model.

Steps to reproduce the behavior

Code sample

from tqdm import tqdm
import time
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import NllbTokenizerFast

# model_name = "facebook/nllb-200-1.3B"
model_name = "facebook/nllb-200-distilled-600M"

# tokenizer = AutoTokenizer.from_pretrained(model_name)
print('Loading Model')
t1 = time.time()
tokenizer = NllbTokenizerFast.from_pretrained(model_name, src_lang="eng_Latn", tgt_lang="spa_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
print('Time taken : ', time.time() - t1)

def predict(x):

    inputs = tokenizer(x, return_tensors="pt", padding=True)

    translated_tokens = model.generate(
        **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["spa_Latn"], max_length=100
    )

    return tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)

Environment

PyTorch Version (e.g., 1.0) : 1.12.0+cpu
OS : Ubuntu
Python version: 3.9.12
transformers : 4.21.1

Possible corrections

convert all numbers to words, this seems to help.

evilc3 commented 1 year ago

anyone?

jrobble commented 1 month ago

Can confirm a similar issue. Here's our code:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M",
                                          use_auth_token=False, local_files_only=True, src_lang="spa_Latn")

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M",
                                              use_auth_token=False, local_files_only=True)

article = "123"
inputs = tokenizer(article, return_tensors="pt")

translated_tokens = model.generate(
    **inputs, forced_bos_token_id=tokenizer.encode("eng_Latn")[1], max_length=30)

output = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

print(output)

Output:

The Commission shall adopt implementing acts.

When changing to src_lang="tgk_Cyrl" the output is 123 What is the meaning of life?

facebookresearch / fairseq

Hallucination with numbers : NLLB English to Spanish translation #4854