facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.63k stars 6.42k forks source link

NLLB sentence trimming #5108

Open alberto-solano opened 1 year ago

alberto-solano commented 1 year ago

Hi! I am using the nllb models for the first time and I am having some trouble for making tranlations of complete documents. I am following the same structure as the hugginface tutorial (https://huggingface.co/docs/transformers/model_doc/nllb) down on my question is the code.

What I am seeing is that sometimes a complete paragraph translation is cutted at a certain point, normally when a sentence ends a the next sentence is not translated. I give an example:

Original sentence -> "Las Conclusiones más recientes, del 21 de octubre de 2020, instaban además a afrontar la opacidad, la complejidad, el sesgo, cierto grado de imprevisibilidad y un comportamiento parcialmente autónomo de ciertos sistemas de IA, para garantizar su compatibilidad con los derechos fundamentales y facilitar la aplicación de las normas jurídicas. El Parlamento Europeo también ha llevado a cabo una gran labor en el ámbito de la IA."

Translated sentence -> "The most recent conclusions of 21 October 2020 also call for the opacity, complexity, complexity, degree of unpredictability and partially autonomous behaviour of certain AI systems to be addressed, to ensure their compatibility with fundamental rights and facilitate the application of legal standards."

Here the last sentence is skipped.

I have setted the max_tokens to 100, but I understood that the maximum number of tokens for the encoder and decoder is 512, so I would not expect a problem referring to exceeding some lenght. So I don't know why this behaviour. I have noticed also a bad behaviour when the text is not well preprocessed e.g introduced double punctuation signs etc.. but in this case like others that I have seen there is not a problem with the input string that I am passing to the model. Here is the code:

model_name = "nllb-200-distilled-1.3B"
text_lang = "esp_Latn"
text = "Las Conclusiones más recientes, del 21 de octubre de 2020, instaban además a afrontar la opacidad, la complejidad, el sesgo, cierto grado de imprevisibilidad y un comportamiento parcialmente autónomo de ciertos sistemas de IA, para garantizar su compatibilidad con los derechos fundamentales y facilitar la aplicación de las normas jurídicas. El Parlamento Europeo también ha llevado a cabo una gran labor en el ámbito de la IA."
max_tokens = 100
#obtains tokenizer associated to detected text language
tokenizer = AutoTokenizer.from_pretrained(os.path.join("facebook/", model_name), src_lang=text_lang)
#obtains untranslated text list in which each element does not exceed a 
#certain token length
text = convert_text(original_text_list, tokenizer, max_tokens)
#load models from local
model = AutoModelForSeq2SeqLM.from_pretrained(os.path.join(models_path, f"{model_name}"), use_auth_token=True)
#creates a list in which translated text will be inserted 
translated_text = []
#translation:
#gets input tokens 
input_text_block_tokens = tokenizer(text, return_tensors="pt")
# translates tokens using the model
translated_text_block_tokens = model.generate(
    **input_text_block_tokens, forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"], max_length=300
)
#converts translated tokens to target language
translated_text_block = tokenizer.batch_decode(translated_text_block_tokens, skip_special_tokens=True)[0]
#appends each translated block
translated_text.append(translated_text_block)

Thank you in advance!

winlinvip commented 1 year ago

I utilize this method to divide the text into paragraphs and sentences, translating only one sentence at a time.

paragraphs = text.split('\n')
for paragraph in paragraphs:
    sentences = paragraph.split('.')
    for sentence in sentences:
        output = translator(sentence, max_length=128)
        translated_text = output[0]['translation_text']
        print(translated_text, end=' ')
    print('')
ritwikmishra commented 1 year ago

If sentences in your text are not neatly segregated by new lines (\n) then I would recommend this library to automatically perform sentence segmentation over various languages (without the headache of language identification).

rio5050 commented 1 year ago

I met this problem too when translating japanese to chinese, but there is no stop symbol in the sentence.