emk / subtitles-rs

Use SRT subtitle files to study foreign languages (in progress)
Apache License 2.0
304 stars 33 forks source link

RFC: What should we do about overlapping subtitles? #60

Open emk opened 6 months ago

emk commented 6 months ago

The core substudy algorithms are all designed around non-overlapping subtitles. There's a built-in "cleaning" layer that will fix small overlaps as best as it can. But a few SRT files use partially overlapping subs to convey semantic and timing information, and other SRT files contain lots of garbage data.

What should we do here? Major options include:

  1. Try a few simple things to produce non-overlapping subs, and if none of those work, try to issue a good error. This is the approach we took in #37. We could try to improve the "cleaning" algorithm to handle more cases, if we know what people are regularly encountering.
  2. Automatically combine subs with non-trivial overlap into one giant combined subtitle. This is tricky, especially with certain Whisper output, which will often produce a 30-second segment overlapping many shorter segments.
  3. Redesign all our algorithms and UI ideas to handle overlapping subtitles.

I am honestly not too interested in pursuing (3) if I can possibly get good results (for most use cases) without it. But (1) vs (2) is a harder tradeoff and I'd love feeback on what people are encountering in their SRT files.

CC @aaron-meyers

aaron-meyers commented 5 months ago

The main concerns I would have with either 1 or 2 is that a lot of videos legitimately have overlapping subtitles, because there are multiple speakers simultaneously (e.g. a TV broadcaster in the background while another character is speaking). In some cases, the 'secondary' subtitle has some unique formatting that could be used to identify it and then treat it essentially as a separate track, but this would need to be detected per file (or implement a bunch of common patterns). For example, in Japanese, Netflix will generally display one subtitle on the bottom (like normal) and a secondary subtitle on the right (vertically). In English I've seen italic used for the secondary subtitle or even different colors (in .ass subtitles).

I haven't looked at your alignment algorithm and I haven't actually tried to implement one myself yet. I was going to start with something pretty simple - iterating over the native (base) subtitle items and aligning the reference subtitles when they have > some % overlap with the native subtitle item (maybe 90%+) by default, with a more relaxed match if there aren't any overlapping subtitles in each track. This is probably naive though 😅