Statistical Significance / Confidence Intervals

mgaido91 commented 1 year ago

Hi @patrick-wilken ,

I think it would be great to offer the opportunity to score statistical significance between two hypotheses. This can be done with bootstrap resampling, even though the main challenge would be to understand how to sample from SRT files, as there is (in general) no alignment between SRTs generated by two systems nor with the references. Do you have comments/ideas on how to do this? I can also assist with the implementation.

Thanks, Marco

patrick-wilken commented 1 year ago

Hi Marco, thanks for the proposal, definitely sounds like a useful addition! You are in particular referring to this https://aclanthology.org/W04-3250.pdf, right? Regarding sampling from subtitles: yes, that seems to be much less obvious than sampling from sentences. For the SubER calculation the files are already split into parts at points in time where both hypothesis and reference agree that there is no subtitle. So far this is an implementation detail for more efficient computation. But this is the closest thing to parallel segments that currently exists and those could maybe be used as units for sampling? There are several problems with this though: 1. segmentation depends on the hypotheses; 2. probably too few segments, depending on specific subtitle content; 3. length of segments varies greatly. Another idea that comes to my mind is to calculate the SubER edit operations on the whole file, sample a subset of reference subtitle blocks, and calculate SubER scores using only the edit operations (and reference length) corresponding to those blocks. But this is only brainstorming right now, have to think it through... I will be travelling the next two weeks, so can only really look into this after that. 🙃

mgaido91 commented 1 year ago

Hi @patrick-wilken ! Thanks for your reply. Yes, that is the paper I was referring to. I looked into the code in these days and the easiest thing that comes to my mind is the following:

In the SubER for loop (https://github.com/apptek/SubER/blob/main/suber/metrics/suber.py#L29), we can keep track of the single edits and reference lengths, instead of just comulating them. Once we have these fine-grained stats, we can bootstrap with them. I already have some sort of implementation doing this. The main issues in this case would be:

How to integrate this in a clean way in the tool?
In this way we can only compute confidence intervals rather than the statistical significance between two hypotheses. But this second thing is very hard for all alignment issues. So as a first step, CI may be enough. What do you think?

Thanks, Marco

apptek / SubER

Statistical Significance / Confidence Intervals #7