Verbose output explaining the SubER score

patrick-wilken commented 2 years ago

@sarapapi That's what I have so far for #3.

Setting --suber-statistics as a command line option would lead to an output like:

{
    "SubER": 46.435,
    "#info": {
        "SubER": {
            "num_reference_words": 5946,
            "num_shifts": 230,
            "num_deletions": 828,
            "num_insertions": 783,
            "num_substitutions": 920
        }
    }
}

Not so sure about the output format, maybe I'm overdoing this json format. But I think it's better than writing to a separate file or just printing those statistics to stderr. The idea of the extra nesting level is that maybe at some point we want additional outputs also for other metrics.

What could further be added here:

Separating num_deletions into num_word_deletions and num_break_deletions, same for insertions and substitutions (substitution of break is "end of block" <-> "end of line"). This gives some additional insights, for example whether there is over-/under-segmentation in general. But it requires an alignment of the words before and after the TER shifts so we know the positions of breaks in the edit operation "trace". Doable though... By the way, num_word_shifts / num_break_shifts does not really make sense because it's ambiguous: swapping a word and a subsequent break could either be word shift right or break shift left.
Output of the full Levenshtein alignment, e.g. in the form of hypothesis + reference word lists and an alignment like 0-0 1-2 2-3 etc. This could be used to create visualizations like Figure 3 in the paper to see which words / breaks exactly are edited. Nice to have, but not so high priority for me at the moment I would say.

sarapapi commented 2 years ago

Hi @patrick-wilken, thank you very much for the PR. It would be helpful for my analysis. I was wondering about the ambiguity between num_word_shifts and num_break_shifts and I think that if the shift operation involves a break, then it should be counted as a break shift otherwise as a word shift, I see no ambiguity in this definition. For sure a natural additional improvement would be having word and break information isolated, as you said, especially for providing information about the segmentation, a critical aspect of subtitling. For example, I observed good (hence, low as you already know) SubER in some cases where Sigma (another metric theoretically developed for evaluating subtitle segmentation) is bad (hence, low) and I was thinking why this bad segmentation does not seem to have an impact on SubER. Maybe, having such a distinction would be helpful to identify if there is a disagreement between the metrics or if they agree somehow on the quality of segmentation.

patrick-wilken commented 2 years ago

Sorry for the slow progress here. I now added separate statistics for word and break edit operations.

{
    "SubER": 46.435,
    "#info": {
        "SubER": {
            "num_reference_words": 5946,
            "num_shifts": 230,
            "num_word_deletions": 620,
            "num_break_deletions": 208,
            "num_word_insertions": 566,
            "num_break_insertions": 217,
            "num_word_substitutions": 834,
            "num_break_substitutions": 86
        }
    }
}

I think that if the shift operation involves a break, then it should be counted as a break shift otherwise as a word shift, I see no ambiguity in this definition.

I will have to look further into the implementation details and what actually happens in practice. But in principle you can get to the same sequence of shifted words by either shifting a word or a break. E.g. A B <eol> C -> A <eol> B C: Is it a shift of B or of <eol>? You could say that any shift going across a break position is also a break shift, but that is a bit complicated.

patrick-wilken commented 2 years ago

Also distinguishing now between num_reference_words and num_reference_breaks. And added tests, seems to work and should already be safe to use.

I think I am still going to flip deletions and insertions in the statistics output. Usually you think of the edit operations being performed on the hypothesis to transform it into the reference, that's the direction the TER code and paper uses, which I also currently use in the output. However, I think people are used to call a missing word in the hypothesis a deletion (although the required edit operation in the sense as above would be an insertion).

sarapapi commented 1 year ago

Sorry for the slow progress here. I now added separate statistics for word and break edit operations.
{
    "SubER": 46.435,
    "#info": {
        "SubER": {
            "num_reference_words": 5946,
            "num_shifts": 230,
            "num_word_deletions": 620,
            "num_break_deletions": 208,
            "num_word_insertions": 566,
            "num_break_insertions": 217,
            "num_word_substitutions": 834,
            "num_break_substitutions": 86
        }
    }
}
I think that if the shift operation involves a break, then it should be counted as a break shift otherwise as a word shift, I see no ambiguity in this definition.

I will have to look further into the implementation details and what actually happens in practice. But in principle you can get to the same sequence of shifted words by either shifting a word or a break. E.g. A B <eol> C -> A <eol> B C: Is it a shift of B or of <eol>? You could say that any shift going across a break position is also a break shift, but that is a bit complicated.

Hi, sorry for my late reply. I think that it should be counted as a break shift not as both break and word shifts. If it involves a break, it is always a break shift otherwise a word shift. But this is my interpretation, of course.

patrick-wilken commented 1 year ago

Okay, let me get more technical. 😄 In the code a shift is defined by the tuple start, length, target. start is the start position of the range of words to be shifted, length is the number of words to shift, and target is the word position to shift to. I guess what you are saying is that if and only if any word in the range start to start + length is a break token then we defined it as a break shift.

But that does not work because a shift can be expressed as multiple different tuples start, length, target. In my example A B <eol> C -> A <eol> B C it could be start=2, length=1, target=1 (shift of <eol>) or start=1, length=1, target=2 (shift of B). By the definition above the first would be a break shift, the second one not. Which should be regarded a contradiction because both really describe the same shift. You could try to alter the definition by adding "or if there is any break token between positions start and target", i.e we shift across some break position. That would be well defined, and then both are break shifts. However, in general that doesn't seem to fit what a break shift should be. For example:

1
00:00:00,000 --> 00:00:05,000
I recognized only
half of the people

vs.

1
00:00:00,000 --> 00:00:05,000
I recognized
half of the people only

You could say the first line break moved to one position earlier, and by the extended definition it would be a break shift. But to me this looks like just a word shift of "only".

Thinking about it, what would work is to regard all shifts that either shift only a single break token or that shift across a single break token as break shifts. Because that are the cases where the text stays the same and only the segmentation changes. All other breaks would then be "word or mixed shifts"...

sarapapi commented 1 year ago

Yes, I see, I got the problem now thanks for the explanation and I agree with your last comment, there are mixed cases and these cannot be counted as break shifts only, I would go for your definition of block shift. Thank you again.

patrick-wilken commented 1 year ago

I rebased and now also switched deletions and insertions in the statistics, meaning edit operations are considered to be applied to the reference and thus deletions are words missing in the hypothesis, insertions are additional words in the hypothesis. This is not the direction the TER paper and code uses, but as far as I know far more common (e.g. https://en.wikipedia.org/wiki/Word_error_rate). So we leave "num_shifts" as is for now? Then good to merge, I think.

apptek / SubER

Verbose output explaining the SubER score #4