apptek / SubER

SubER - Subtitle Edit Rate
Apache License 2.0
21 stars 3 forks source link

New metric "SubER-cased"; also use tokenizer for "WER-cased" #6

Closed patrick-wilken closed 1 year ago

patrick-wilken commented 1 year ago

See #5 Adds the metric "SubER-cased", which is a case- and punctuation sensitive variant of SubER. Tokenization is used to treat punctuation as separate tokens. Note that the analysis in our paper shows weaker correlation with human post-editing effort. However, this variant might be useful when punctuation and casing errors are considered to be of high importance.

I also added tokenization to "WER-cased" to be consistent with "SubER-cased", because it makes sense intuitively, and also because it shows a slightly higher correlation than what we reported for "WER + case/punct" in the paper. (The numbers in Table 1 row 2 become -0.685, -0.520, -0.504, -0.657.) I think no one relies on the exact behaviour of "WER-cased" yet and it's ok to make a breaking change.