Unexpected behaviour (removal of words) when using SubER time alignment

miagrbr-MC commented 1 week ago

Good day,

Thank you so much for the creation and maintenance of SubER, I have found it to be very useful as part of research I am conducting regarding translation of subtitles :)

I have however noticed some unexpected behaviour when using SubER to align two sets of subtitles, in particular when using the time alignment method. After alignment, some words will be missing from the hypothesis.

To demonstrate this issue, I am making use of the ITV Studios (available here: https://iwslt.org/2024/subtitling) dataset. In particular I am going to be using the 01.srt English subtitle as reference and the 01.srt Spanish subtitle as hypothesis.

I use SubER time alignment as follows (the referenced files are attached, I had to change to extension to be txt instead of srt to allow for upload to Github): en-01-original.txt es-01-original.txt SubER - Time.txt

python -m suber.tools.align_hyp_to_ref 
-H "es-01-original.srt" \
-R "en-01-original.srt" \
-o "SubER - Time.txt" \
-f SRT -F SRT -m time

It can be noted that after alignment, in the SubER - Time.txt file subtitle blocks 1 through 3 get removed entirely, a small portion from subtitle block 4 is also removed, but some words are kept. After alignment some words also go missing from subtitle block 5 (fotos) as well as from subtitle block 14 (padre). This issue is not exclusive to subtitle blocks that are near the start of the subtitle file, I have also seen it affect subtitle blocks in the middle and near the end of the subtitle file as well.

This example is quite extreme, but I have also observed this behaviour with other subtitle files to a lesser extent where only a single word gets removed, which is a lot harder to notice.

I would just like to check whether this is expected behaviour, and if so, whether there are specific "rules" that govern the removal/inclusion of words in the hypothesis after alignment. Notably, when using the Levenshtein method for alignment, there is no unexpected removal of words.

Some information regarding my environment is as follows:

OS is Ubuntu 22.04.3 LTS
Python 3.12.2
subtitle-edit-rate version 0.2.0
jiwer version 2.3.0
numpy version 1.26.4
python-Levenshtein version 0.12.2
sacrebleu version 2.0.0

Thank you again, I really appreciate your time :)

patrick-wilken commented 1 week ago

Hi,

thanks for reaching out. Yes, this is expected behavior. The align_hyp_to_ref tool is mainly there to inspect what segment alignment is done internally before calculating BLEU and other scores. And the "time" method is supposed to be an implementation of what is described for T-BLEU in https://www.isca-archive.org/interspeech_2021/cherry21_interspeech.pdf . See the sentence after equation 1:

Token y_t is aligned with the reference segment y∗ iff start(y∗) ≤ start(y_t) < end(y∗).

I interpret it in a way that hypothesis words which do not fall into any reference segment are dropped during the time-alignment. One could think of e.g. assigning these words to the nearest reference segment, but it is not clear whether you want that for T-BLEU calculation, words which are timed incorrectly should probably count as an error. And because nothing like such a rule is mentioned in the paper I don't think it was done.

We also found this dropping of words in T-BLEU problematic as it is too harsh at the edges of segments. But the purpose was to implement T-BLEU as a baseline, not to optimize it.

However: as you are using an English and a Spanish file I'm assuming you want to extract parallel source and target segments? That's quite a different task anyways. And not one that the SubER tool tries to solve. SubER gets the automatically created subtitles (hypothesis) and the ground truth reference (usually human created), and those should be in the same language, of course.

patrick-wilken commented 1 week ago

It would be pretty easy to modify the conditions here to not drop words: https://github.com/apptek/SubER/blob/main/suber/hyp_to_ref_alignment/time_alignment.py#L28

But I don't want to add that unless it is needed for some metric. The tools.align_hyp_to_ref tool is really just to see what happens internally. So if you take the time-aligned output hypothesis and call sacrebleu directly with that, it should give the same score as computing T-BLEU with the SubER tool. But it's not meant as a way to create a plain text test set from parallel subtitle files, for example. (In my experience you need a more involved method to get high quality parallel sentences anyway, for example vecalign, just relying on time stamps is too risky. Depends on the files of course. In case that's what you're trying to do.)

miagrbr-MC commented 1 week ago

Thank you so much for the prompt and detailed response @patrick-wilken :) you are correct in saying that I was trying to use SubER to create parallel subtitle files, but your explanation as to how the original algorithm from Cherry works as well as the rationale behind SubER not being intended for the purpose of parallel file creation makes perfect sense.

I will check out vecalign!

apptek / SubER

Unexpected behaviour (removal of words) when using SubER time alignment #11