Unbabel / COMET

A Neural Framework for MT Evaluation
https://unbabel.github.io/COMET/html/index.html
Apache License 2.0
441 stars 72 forks source link

[INPUT] Text Length of Input (source, reference, and hypothesis) #209

Open foreveronehundred opened 3 months ago

foreveronehundred commented 3 months ago

For translation quality estimation of COMET, I think there is no limitation of the text length. However, from my personal perspective, I do not think the estimation will be accurate if the text is too long.

So, what text length (of source, reference, and hypothesis) do you recommend?

ricardorei commented 3 months ago

Hi @foreveronehundred! The code does not break when running very very large segments BUT the models truncate the input if it goes above 512 tokens. For models like wmt22-cometkiwi-da the input will be shared for both source and translation which means that the total number of tokens from source and translation should not be longer than 512 tokens....

Still, 512 tokens is a long input. Its more than enough to input several sentences together and evaluate entire paragraphs. Maybe not enough for an entire 2 page document tho.

A quick way to test it is to tokenize both inputs and get their length:

from transformers import XLMRobertaTokenizer
source = ["Hello, how are you?", "This is a test sentence."]
translations = ["Bonjour, comment ça va?", "Ceci est une phrase de test."]
# This is the same for most COMET models
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base") 
# Tokenize and count tokens for each pair
for src, trans in zip(source, translations):
     # Tokenize sentences
    src_tokens = tokenizer.encode(src, add_special_tokens=False)
    trans_tokens = tokenizer.encode(trans, add_special_tokens=False)

    # Jointly encode and count tokens
    joint_tokens = tokenizer.encode(src, trans, add_special_tokens=True, truncation=True)

    # Output token counts
    print(f"Source: {src}\nTranslation: {trans}")
    print(f"Source tokens: {len(src_tokens)}")
    print(f"Translation tokens: {len(trans_tokens)}")
    print(f"Jointly encoded tokens: {len(joint_tokens)}")
    print("="*30)
foreveronehundred commented 3 months ago

Thanks for reply. I think the length is enough for general cases. By the way, I want to know the token length of the training data. Could you give some statistics (Mean, STD, etc.)?