Closed patrick-wilken closed 1 year ago
Hi @patrick-wilken, thank you very much for the PR. It would be helpful for my analysis.
I was wondering about the ambiguity between num_word_shifts
and num_break_shifts
and I think that if the shift operation involves a break, then it should be counted as a break shift otherwise as a word shift, I see no ambiguity in this definition.
For sure a natural additional improvement would be having word and break information isolated, as you said, especially for providing information about the segmentation, a critical aspect of subtitling. For example, I observed good (hence, low as you already know) SubER in some cases where Sigma (another metric theoretically developed for evaluating subtitle segmentation) is bad (hence, low) and I was thinking why this bad segmentation does not seem to have an impact on SubER. Maybe, having such a distinction would be helpful to identify if there is a disagreement between the metrics or if they agree somehow on the quality of segmentation.
Sorry for the slow progress here. I now added separate statistics for word and break edit operations.
{
"SubER": 46.435,
"#info": {
"SubER": {
"num_reference_words": 5946,
"num_shifts": 230,
"num_word_deletions": 620,
"num_break_deletions": 208,
"num_word_insertions": 566,
"num_break_insertions": 217,
"num_word_substitutions": 834,
"num_break_substitutions": 86
}
}
}
I think that if the shift operation involves a break, then it should be counted as a break shift otherwise as a word shift, I see no ambiguity in this definition.
I will have to look further into the implementation details and what actually happens in practice. But in principle you can get to the same sequence of shifted words by either shifting a word or a break.
E.g. A B <eol> C
-> A <eol> B C
: Is it a shift of B
or of <eol>
? You could say that any shift going across a break position is also a break shift, but that is a bit complicated.
Also distinguishing now between num_reference_words
and num_reference_breaks
.
And added tests, seems to work and should already be safe to use.
I think I am still going to flip deletions and insertions in the statistics output. Usually you think of the edit operations being performed on the hypothesis to transform it into the reference, that's the direction the TER code and paper uses, which I also currently use in the output. However, I think people are used to call a missing word in the hypothesis a deletion (although the required edit operation in the sense as above would be an insertion).
Sorry for the slow progress here. I now added separate statistics for word and break edit operations.
{ "SubER": 46.435, "#info": { "SubER": { "num_reference_words": 5946, "num_shifts": 230, "num_word_deletions": 620, "num_break_deletions": 208, "num_word_insertions": 566, "num_break_insertions": 217, "num_word_substitutions": 834, "num_break_substitutions": 86 } } }
I think that if the shift operation involves a break, then it should be counted as a break shift otherwise as a word shift, I see no ambiguity in this definition.
I will have to look further into the implementation details and what actually happens in practice. But in principle you can get to the same sequence of shifted words by either shifting a word or a break. E.g.
A B <eol> C
->A <eol> B C
: Is it a shift ofB
or of<eol>
? You could say that any shift going across a break position is also a break shift, but that is a bit complicated.
Hi, sorry for my late reply. I think that it should be counted as a break shift not as both break and word shifts. If it involves a break, it is always a break shift otherwise a word shift. But this is my interpretation, of course.
Okay, let me get more technical. 😄 In the code a shift is defined by the tuple start, length, target
. start
is the start position of the range of words to be shifted, length
is the number of words to shift, and target
is the word position to shift to. I guess what you are saying is that if and only if any word in the range start
to start + length
is a break token then we defined it as a break shift.
But that does not work because a shift can be expressed as multiple different tuples start, length, target
. In my example A B <eol> C
-> A <eol> B C
it could be start=2, length=1, target=1
(shift of <eol>
) or start=1, length=1, target=2
(shift of B
). By the definition above the first would be a break shift, the second one not. Which should be regarded a contradiction because both really describe the same shift.
You could try to alter the definition by adding "or if there is any break token between positions start
and target
", i.e we shift across some break position. That would be well defined, and then both are break shifts. However, in general that doesn't seem to fit what a break shift should be. For example:
1
00:00:00,000 --> 00:00:05,000
I recognized only
half of the people
vs.
1
00:00:00,000 --> 00:00:05,000
I recognized
half of the people only
You could say the first line break moved to one position earlier, and by the extended definition it would be a break shift. But to me this looks like just a word shift of "only".
Thinking about it, what would work is to regard all shifts that either shift only a single break token or that shift across a single break token as break shifts. Because that are the cases where the text stays the same and only the segmentation changes. All other breaks would then be "word or mixed shifts"...
Yes, I see, I got the problem now thanks for the explanation and I agree with your last comment, there are mixed cases and these cannot be counted as break shifts only, I would go for your definition of block shift. Thank you again.
I rebased and now also switched deletions and insertions in the statistics, meaning edit operations are considered to be applied to the reference and thus deletions are words missing in the hypothesis, insertions are additional words in the hypothesis. This is not the direction the TER paper and code uses, but as far as I know far more common (e.g. https://en.wikipedia.org/wiki/Word_error_rate). So we leave "num_shifts" as is for now? Then good to merge, I think.
@sarapapi That's what I have so far for #3.
Setting
--suber-statistics
as a command line option would lead to an output like:Not so sure about the output format, maybe I'm overdoing this json format. But I think it's better than writing to a separate file or just printing those statistics to stderr. The idea of the extra nesting level is that maybe at some point we want additional outputs also for other metrics.
What could further be added here:
num_deletions
intonum_word_deletions
andnum_break_deletions
, same for insertions and substitutions (substitution of break is "end of block" <-> "end of line"). This gives some additional insights, for example whether there is over-/under-segmentation in general. But it requires an alignment of the words before and after the TER shifts so we know the positions of breaks in the edit operation "trace". Doable though... By the way,num_word_shifts
/num_break_shifts
does not really make sense because it's ambiguous: swapping a word and a subsequent break could either be word shift right or break shift left.0-0 1-2 2-3
etc. This could be used to create visualizations like Figure 3 in the paper to see which words / breaks exactly are edited. Nice to have, but not so high priority for me at the moment I would say.