Unbabel / COMET

A Neural Framework for MT Evaluation
https://unbabel.github.io/COMET/html/index.html
Apache License 2.0
441 stars 72 forks source link

Mismatch between error span and offsets #179

Closed danieldeutsch closed 6 months ago

danieldeutsch commented 8 months ago

Hello,

I am using the XCOMET-XL model to predict error spans, and I noticed that the span offsets don't always match the text of the span. For example:

{
    "src": "According to Press TV, the Ministry announced on Saturday that 58 Iranians died from the disease, noting that out of the new cases detected over the past 24 hours, 286 patients were admitted to hospital.",
    "mt": "Laut Press TV gab das Ministerium am Samstag bekannt, dass 58 Iraner an der Krankheit gestorben sind, und stellte fest, dass von den neuen Fällen, die in den letzten 24 Stunden entdeckt wurden, 286 Patienten ins Krankenhaus eingeliefert wurden.",
    "ref": "Laut Press TV meldete das Ministerium am Samstag, dass 58 Iraner an der Krankheit verstorben waren, und gab an, dass unter den festgestellten neuen Fällen in den letzten 24 Stunden 286 Patienten ins Krankenhaus kamen.",
    "COMET": 0.990031361579895,
    "errors": [
        {
            "text": "eingeliefert",
            "confidence": 0.4963693618774414,
            "severity": "minor",
            "start": 223,
            "end": 236
        }
    ]
}

The start and end offsets include the whitespace before the token, but the text doesn't:

text = "Laut Press TV gab das Ministerium am Samstag bekannt, dass 58 Iraner an der Krankheit gestorben sind, und stellte fest, dass von den neuen Fällen, die in den letzten 24 Stunden entdeckt wurden, 286 Patienten ins Krankenhaus eingeliefert wurden."
start, end = 223, 236
text[start:end]
>>> ' eingeliefert'

Every time I see this, the span offsets include whitespace at the beginning of the span, but the text does not. Maybe when you decode the token IDs of the span and the first token is whitespace, it gets removed?

Thanks!

danieldeutsch commented 8 months ago

I did find examples where the offsets and text differ beyond leading whitespace, and I think it's due to weird characters in the input:

{
    "src": "高龄老人坐着面包车从南方到东北看雪??",
    "mt": "The elderly are watching the snow from the south to the northeast in a van? ?",
    "ref": "An elderly goes to watch the snow in the northeast from the south in a minibus? ?",
    "COMET": 0.8957364559173584,
    "errors": [
        {
            "text": "ly are watching",
            "confidence": 0.39867228269577026,
            "severity": "minor",
            "start": 9,
            "end": 24
        },
        {
            "text": "to",
            "confidence": 0.3982390761375427,
            "severity": "minor",
            "start": 48,
            "end": 51
        },
        {
            "text": "van??",
            "confidence": 0.4210003614425659,
            "severity": "minor",
            "start": 70,
            "end": 77
        }
    ]
}

The second question mark in the mt text is a weird character. It gets normalized in the span text to a normal quote.

{
            "src": "There's a circularity to it...",
            "mt": "Darin besteht eine Zirkularität …",
            "ref": "Das ist ein richtiger Kreislauf...",
            "COMET": 0.9912000298500061,
            "errors": [
                {
                    "text": "Darin besteht eine Zirkularität...",
                    "confidence": 0.5221754312515259,
                    "severity": "minor",
                    "start": 0,
                    "end": 33
                }
            ]
        }

In the mt text, there's a space followed by 3 periods as a single character, but the span text removes the whitespace and uses 3 separate period characters.

I assume the offsets are ok to use. I only ran into this because I was verifying that COMET doesn't change the source text when it makes span predictions. This is important when you evaluate the spans so you can directly compare the predicted spans to the MQM spans. If the text is edited, the mapping might not be correct.

ricardorei commented 7 months ago

Hi @danieldeutsch, you are right! the offsets are correct and the "text" field is more informative. I get the text field by detokenizing the token ids belonging to a span. Yet, if you detokenize just a part of the original input, you might get slightly different output. Whitespaces are good examples as they are sometimes encoded with the suffix _