apptek / SubER

SubER - Subtitle Edit Rate
Apache License 2.0
21 stars 3 forks source link

SubER computation on more than 1 srt file #1

Closed sarapapi closed 2 years ago

sarapapi commented 2 years ago

Hi, this more than a problem is a question: if I have more than 1 .srt file with which I can make a comparison, how can I compute the SubER metric (and also the AS-/T-BLEU) metrics? Is it sufficient to concatenate them and then compute the metrics or we need something more sophisticated?

Thanks for your work.

patrick-wilken commented 2 years ago

Hi, let me make sure I understand the question:

I have more than 1 .srt file with which I can make a comparison

So you are referring to using multiple references? This is not yet supported and it's also not fully obvious how to do with SubER, actually an interesting question we hadn't considered yet! I could put some thought into it. 😊

Or do you mean you have the subtitles for one video in multiple files corresponding to subsequent video segments? Then yes, you would need to concatenate to score the video as a whole.

sarapapi commented 2 years ago

Hi, thanks for your quick response! The second option you mentioned was my original question but also the first one would be helpful. I was thinking about how to concatenate them in a correct way because if I concatenate them directly I obtain a terrible SubER score compared to the ones obtained by computing the metric on single srt files. Another option is to offset the start time of the srt files that follow the first such that they resemble a unique subtitle but I think that this will influence the evaluation (maybe the end of the previous srt could match the beginning of the next srt file in the reference). Maybe I am wrong. Do you have any suggestions about doing it more correctly?

Another question, that is not related to this "issue", is about the outcome of some evaluations I made. One system results to have a high SubER but also a high t-BLEU and AS-BLEU scores while the other system shows exactly the opposite behavior. Do you have any hint about how to interpret these results? Thanks again

patrick-wilken commented 2 years ago

Another option is to offset the start time [...] but I think that this will influence the evaluation.

Reference and hypothesis subtitles have to be consistent on the time scale, whatever you do. So if one is shifted this will lead to terrible SubER (and t-BLEU) scores. I will make that clear in the README. (But the absolute position in time should not matter, so you could shift both by the same duration.)

If I understand you correctly you want to build a test set out of several video clips. And I guess what happens if you just concatenate the files the clips all start at 0 seconds and will therefore overlap in time, which breaks the metric computation. I will add an assertion to only allow input files where the subtitle timings are monotonically increasing.

So yes, you currently would have to shift all segments in time when concatenating. This simply corresponds to concatenating the original audio / video files to create a test set in the first place.

maybe the end of the previous srt could match the beginning of the next srt file in the reference

That's also how I would do it. We could add support for multiple files and do this automatically. Then again, this should be possible with other existing software. I'm not sure... Also note, that evaluating single files and then computing a weighted average should get you close to the score for the concatenated file. Although having an exact score is obviously better...

patrick-wilken commented 2 years ago

One system results to have a high SubER but also a high t-BLEU and AS-BLEU scores while the other system shows exactly the opposite behavior.

Hard to say without seeing the file. But very bad segmentation, i.e. many line breaks at different positions than in the reference, would be one explanation.

sarapapi commented 2 years ago

Another option is to offset the start time [...] but I think that this will influence the evaluation.

Reference and hypothesis subtitles have to be consistent on the time scale, whatever you do. So if one is shifted this will lead to terrible SubER (and t-BLEU) scores. I will make that clear in the README. (But the absolute position in time should not matter, so you could shift both by the same duration.)

If I understand you correctly you want to build a test set out of several video clips. And I guess what happens if you just concatenate the files the clips all start at 0 seconds and will therefore overlap in time, which breaks the metric computation. I will add an assertion to only allow input files where the subtitle timings are monotonically increasing.

So yes, you currently would have to shift all segments in time when concatenating. This simply corresponds to concatenating the original audio / video files to create a test set in the first place.

maybe the end of the previous srt could match the beginning of the next srt file in the reference

That's also how I would do it. We could add support for multiple files and do this automatically. Then again, this should be possible with other existing software. I'm not sure... Also note, that evaluating single files and then computing a weighted average should get you close to the score for the concatenated file. Although having an exact score is obviously better...

Yes is exactly what I meant. I will shift the following segments when I concatenate as you suggested. I think that will be useful to add this option to the library in the future since it can happen to evaluate subtitling systems on a corpus test set containing more files and consequently srt. Thank you for your replies!