Punctuation and case sensitive

sarapapi commented 1 year ago

Hi @patrick-wilken, I am here again to point out some observations I made on the SubER outcomes I obtained from the analysis of different models. I know that you found no significative differences in the correlation of SubER with and without punctuation and true casing, as reported in the paper, but I think it would be very useful to add an option to the SubER tool in which you can indicate whether to use or not punctuation and true casing. Currently, you are normalizing the text, and tokenization is not needed to compute TER (as far as I understood from your implementation) but it would become necessary if we avoid the normalization step (as they do in the sacrebleu tool). I noticed that computing SubER by normalizing the text strongly favors systems that are not good at inserting punctuation and correctly capitalizing words and the outcomes of SubER are in fact in contrast with BLEU scores but also with Sigma scores. Just to give you an idea, I found that a system scoring 5 BLEU point less than all the other systems that I tested can achieve a lower (thus, better) SubER and the difference in the quality of the translation also emerges upon manual evaluation and the absence of punctuation strongly affects the understanding. Therefore, I suggest integrating the option that I mentioned before and maybe further exploring this aspect.

Thanks

patrick-wilken commented 1 year ago

Yes, thanks for those proposals. As you saw we experimented with casing and punctuation, but also with tokenization, when designing the metric and it is indeed a bit unfortunate that normalized SubER worked best in our experiments. 😅 I excluded other versions from the code mainly to avoid confusion about the metric definition. But I guess I can add "SubER-cased" as a metric which would be true-cased and with punctuation, in analogy to the "WER-cased" metric. By the way, the default for TER in other tools is also case insensitive... Regarding tokenization: the default for TER always seems to be to turn it off. Probably for historic reasons? I agree that it is intuitive to enable it, I don't know if somebody has shown rigorously that it improves the TER metric. I can revisit my experiments and see what numbers I get with/without tokenization for SubER.

sarapapi commented 1 year ago

Hi Patrick, it would be great to include the SubER-cased. Moreover, I saw in the original TER implementation (TERCOM) that the input is not actually tokenized but can be enabled with the "normalized" parameter as it is in sacrebleu. However, in the official paper, the authors wrote "In addition, punctuation tokens are treated as normal words and mis-capitalization is counted as an edit.", thus punctuation is treated as a token (which is true only if we tokenize -- or normalize in the TERCOM library -- the text) and the computation is actually case sensitive. I think that they set as default parameters in the library something different from what they actually used for the official calculation (which I think is the correct one).

patrick-wilken commented 1 year ago

Ok, that all makes sense. :) I implemented it, see #6. Maybe you want to check the details. Another question is whether we should also change the "TER" metrics to be tokenized and case-sensitive. But I would rather keep it just an interface to sacrebleu with default options. Because it's not really the focus of this repo to provide all the options for the other metrics. But it's easy to set them in suber/metrics/sacrebleu_interface.py if someone needs them.

sarapapi commented 1 year ago

Hi Patrick, sorry for my late reply but I have taken some time to take a look at the implementation and compute the metrics by myself. The cased version seems sound to me and the results are now consistent with that of the other metrics that I am using. Thanks again for your time.

patrick-wilken commented 1 year ago

That sounds good! I will merge then.

apptek / SubER

Punctuation and case sensitive #5