Closed minghao-wu closed 1 year ago
Hi @minghao-wu, COMET truncates the sentence for very long sentences. Hugging face tokenizer already does that for you if you activate truncate=True. You can check the code here
Also, since we are using a triplet encoder architecture this means that both source, MT and reference have an equal capacity of 512 tokens (1536 tokens in total).
In my experience with typical MT testsets this is enough. What is your use case exactly?
Hi @ricardorei ,
Thank you for your answer.
Yes, 512 is typically long enough for a single sentence. However, I am working on DocNMT and trying to use COMET as a document-level measure by concatenating all the source sentences, references and hypotheses within the same document. For example, one of the common benchmark dataset in DocNMT is IWSLT2017 En-De. The document length in the test set is typically longer than 1000 words. Do you have any suggestions for this situation?
You can take a look into this work
They extended COMET and other metrics to take into account context.
You still need to score each individual segment within the document but, for each segment, the score will be computed using the previous sentences and more context sensitive.
I want to integrate that into the next release of COMET but for now I think you should take a look at their work.
You can look at their paper here: Embarrassingly Easy Document-Level MT Metrics: How to Convert Any Pretrained Metric Into a Document-Level Metric
Thank you for your suggestion. I will have a look.
@ricardorei are you planning to add document level context in COMET sometime soon already?
same question
❓ Questions and Help
Before asking:
I briefly read though the source code and didn't get the answer.
What is your question?
I attempted to apply COMET to some excessively long document pairs (more than 3000 words for both source and target sequence). I didn't get any error report when doing so. How does COMET process the excessively long sequence?
Code
#### What have you tried? #### What's your environment? - OS: Linux - Packaging: pip - Version: 1.1.3