SubtitleEdit / subtitleedit

the subtitle editor :)
http://www.nikse.dk/SubtitleEdit/Help
GNU General Public License v3.0
8.29k stars 888 forks source link

Suggestion for a tool "Merge short line with previous or next line" #8714

Open oep42 opened 1 month ago

oep42 commented 1 month ago

A problem with using Whisper to generate subtitles, is that it sometimes generates very short subtitles. Until now, there has not been an automatic method in Subtitle Edit to correct such short subtitles.

Such short subtitles have been discussed before here and here.

Examples of short subtitle lines:

Short line, example 1

Short line, example 2

Short line, example 3

Short line, example 7

Short line, example 8

Short line, example 9

Short line, example 10

Short line, example 11

Short line, example 17-A

Short line, example 13

Short line, example 21-A

This is about short subtitle lines preceded and followed by longer subtitle lines. I suggest to add a "Merge short line with previous or next line" tool to Subtitle Edit, which can automatically do one of these two things:

Rules are needed to choose between these two alternatives. These rules are described below.

The current "Merge short lines" tool could be used as a starting point for creating the "Merge short line with previous or next line" tool. The "Merge short line with previous or next line" tool, could be both in the Tools menu and in "Batch convert" as a convert option.

Strategy

If the previous/next line is about the same length as the maximum allowed line length, there may be no space to add a short line to it. So to allow this approach of automatically adding short lines to the previous/next line, the previous/next line must be a little shorter than the maximum allowed length. This "a little shorter" should be equal to the maximum length of a short line.

(Well, not exactly, because when a short line is added to another line, a space is also added. However, a space may not be counted as a character by Subtitle Edit, depending on the value of "Cps/line-length" in the general settings rules of Subtitle Edit. For simplicity's sake, this adding of a space when merging lines will be disregarded here.)

Based on this, I devised a strategy to make an automatic correction of short subtitles feasible. This is an example of this strategy:

What matters is that the maximum line length of subtitles created by Whisper, is shorter than the maximum line length allowed in Subtitle Edit. So other values than 46 and 37 can be used.

If you use a strategy where the maximum length of subtitle lines generated by Whisper is shorter than the maximum length of subtitle lines in the final result, it is possible to automatically merge short lines with the previous or next line, provided these short lines are shorter than or equal to the difference between these maximum lengths.

Additional parameters

In this context, when using a Purfview Faster-Whisper engine, it makes sense to also use the --sentence parameter. The --sentence parameter ensures that the output of the engine is split into sentences, each starting on a new line. Placing the beginning of a sentence at the beginning of a subtitle line, improves the subtitling quality. (The --sentence parameter does not have a numerical value.)

There are two ways to use the --sentence parameter:

About --max_line_count and --max_comma_cent:

Rules for merging short lines

I suggest to use these rules for merging short lines:

[A] If the short line ends with an abbreviation from a list such as ['Arq.', 'bl.', 'Capt.', 'Cnel.', 'Col.', 'dipl.', 'Dhr.', 'Dr.', 'Dra.', 'dr.', 'DR.', 'DRA.', 'Eng.', 'eng.', 'Exmo.', 'exmo.', 'Fr.', 'Frk.', 'Gen.', 'Gov.', 'gov.', 'Gral.', 'gosp.', 'g.', 'Hptm.', 'Hra.', 'Hr.', 'Ing.', 'ing.', 'Jr.', 'jr.', 'Kenr.', 'Lic.', 'Lt.', 'M.', 'Mevr.', 'Mme.', 'Mna.', 'mna.', 'Mno.', 'mno.', 'Mr.', 'mr.', 'Mrs.', 'mrs.', 'Mtra.', 'Mtro.', 'Ms.', 'P.', 'Pe.', 'pe.', 'Pr.', 'Prof.', 'prof.', 'Rev.', 'Rva.', 'Sgto.', 'Sgt.', 'Sig.', 'Sr.', 'sr.', 'Sra.', 'Srta.', 'Srs.', 'srs.', 'St.', 'sv.', 'Tte.', 'Tri.', 'vlč.', 'Vol.', 'vol.', 'vols.'], and if the next line starts with a capital letter, then merge the short line with the next line.

[B] Else if the short line starts with a capital letter, and if the previous line ends with an abbreviation from a list of abbreviations such as the one above, then merge the short line with the previous line.

[C] Else if the short line ends with a separate capital_letter-period combination like "A." or "A.A.", and if the next line starts with a capital letter, then merge the short line with the next line.

[D] Else if the short line starts with a capital letter, and if the previous line ends with a separate capital_letter-period combination like "A." or "A.A.", then merge the short line with the previous line.

[E] Else if the short line begins with a lowercase letter, a capital letter, or a digit:

[F] Else if the short line begins with a hyphen or a comma, then merge the short line with the previous line.

[G] Else merge the short line with the shortest line (in terms of characters) of the previous or next line.

Of course, when using these rules, check whether a previous or next line exists, and verify that all other relevant rules/options are met. It is expected that rule [E] will cover the vast majority of cases.

Instead of using a single list of abbreviations, it may be better to use one or more do-not-break-after lists from "Options/Settings/Tools/Auto br", with the possibility to select a list of a specific language, or select lists of several languages, or select all do-not-break-after lists. (Copilot, and probably other AI systems, are capable of generating lists of such abbreviations. See also here.)

It is also possible to use a do-not-break-after list with articles such as "the" and "a". In this case, the requirement in rules [A] and [B] that the abbreviation (here: article) be followed by a capital letter should be dropped. Alternatively, another set of rules similar to [A] and [B] (specifically for articles) can be added that does not include this requirement.

Options for the tool

I suggest giving the "Merge short line with previous or next line" tool these six options:

The strictest maximum limit of the first and second options will be applied. This limit refers to the maximum length or duration of a short line.

The strictest maximum limit of the third and fourth options will be applied. This limit refers to the maximum length or duration of the subtitle line that results from the merging.

Analogous to the suboptions of the batch convert option "Apply duration limits", the numerical value of the third and fourth options can be based on a numerical value in the general settings rules of Subtitle Edit:

It may be obvious to set the "Max. characters in short line" option to a value equal to the difference between Single line max. length and --max_line_width, because this difference represents the 'guaranteed space' to add a short line to the previous or next line. However, it is not necessary to choose this difference as a value for "Max. characters in short line". When using "Max line length" and/or "Max line duration" as a constraint, it is possible to set a higher value for "Max. characters in short line" than that difference. This will allow longer short lines to be merged as well, depending on local circumstances (i.e. merge if there is enough space in the previous or next line).

When using "Max line length" and/or "Max line duration" as a constraint, the difference between Max. single line length and --max_line_width can be considered as the lowest sensible value for the "Max. characters in short line" option. In practice, then, this difference can be seen as the minimum value for the "Max. characters in short line" option.

The "Max. milliseconds in short line" option can be used to exclude short lines with a relatively long duration from being merged.

Multiple short lines in a row

Short line, example 15

Short line, example 6

Short line, example 14

Sometimes there are multiple short lines in a row. Before using the "Merge short line with previous or next line" tool, use the "Merge short lines" tool to merge multiple short lines in a row. After this has been done, it will be clear whether a single short line that was part of multiple short lines in a row, will remain. So it is important that the "Merge short line with previous or next line" tool is only used after the "Merge short lines" tool has been used.

When using the "Merge short lines" tool in this context, it is important that its "Maximum characters in one paragraph" (= "Maximum characters in one line") option is set to the same value as the value that was used with --max_line_width when generating the subtitles with Whisper. When following the example above, this value would be set to 37. By using the same value, the maximum line length of the results of the "Merge short lines" tool, will stay the same as in the originally generated subtitles. This maximum line length should stay the same because of the strategy described above.

One aspect of the "Merge short lines" tool is not fully satisfactory. See the example below.

Short line, example 16

In this example, it would be best to merge "feelings." with the previous line, and not with the next line as the "Merge short lines" tool would do. (Of course, this would only be best, when using a strategy where the maximum length of the originally generated subtitle lines, is shorter than the maximum length of the subtitle lines in the final result.)

To prevent a merge like this, this option could be added to the "Merge short lines" tool: "Do not merge short end of sentence with next line" (with an option to set max. characters for "short end of sentence"). {A "short end of sentence" is a short line that ends with one of these characters [. , : ; ! ?], while the previous line does not end with one of these characters [. , : ; ! ?]. So it's actually more than just the end of a sentence, but I couldn't think of a better name.} If this option would be added to the "Merge short lines" tool, and if there would be a tool that can merge "feelings." with the previous line (like the tool that is suggested here), then that would be great.

(I would suggest to change the name of the "Maximum characters in one paragraph" option of the "Merge short lines" tool into "Maximum characters in one line", because this is about lines, not about paragraphs.)

Added later: In a situation of multiple consecutive short lines, it may be unnecessary to use the "Merge short lines" tool before using the "Merge short line with previous or next line" tool. Perhaps the "Merge short line with previous or next line" tool can, by itself, properly process multiple short lines in a row (if a high value is set for "Max. characters in short line"). If so, there is no need to add a "Do not merge short end of sentence with next line" option to the "Merge short lines" tool.

Unsolved problem

In a case like this, there is a problem with rule [C] (if the short line ends with a separate capital_letter-period combination like "A." or "A.A.", and if the next line starts with a capital letter, then merge the short line with the next line): Short line, example 18-A

This would also be a problem: Short line, example 19-A

This wouldn't be a problem: Short line, example 20-A

Based only on the syntax of the subtitle text, it is impossible to differentiate between initials and things that have a capital letter as their name. If it is estimated that separate capital_letter-period combinations like "A." or "A.A." are more common than things that have a capital letter as their name, it may be best to keep this rule. Or perhaps, make this rule optional (including the optional exclusion of the relevant cases from rule [G]). [An alternative might be, to create an optional do-not-break-after list for separate capital_letter-period combinations like "A.", and another optional do-not-break-after list for separate double_capital_letter-period combinations like "A.A.".]

Rule [D] has a problem that is analogous to the problem with rule [C]. The same considerations for addressing the problem with rule [C] apply to the problem with rule [D].


(Most of the examples above were created in an empty Subtitle Edit with a Purfview Faster-Whisper engine and the "distil-large-v3" or "large-v2" model, with at least these advanced parameters for the engine: "--max_line_width 37 --max_line_count 2 --sentence --max_comma_cent 50 --threads 24". No checkmarks at "Auto adjust timings" and "Use post-processing" in the "Audio to text" dialog box. After that, "Batch convert" was used with these convert options: "Remove text for HI", "Fix common errors", "Auto balance lines", "Apply minimum gap between subtitles", "Merge short lines", "Add durations [adding 0.100 or 0.400 seconds]". Only the following four of the examples above, are invented examples: one example with "Dr. Kissinger", two examples with "Plan B", and one example with "F. Brill".)

BenJamesAndo commented 2 weeks ago

I managed to find a workaround to get it to a point that I'm happy with it. This is my command line. In whisper-faster.exe I run --model medium.en --output_format srt --sentence --word_timestamps true --device cuda

Then in Subtitle Edit I run SubtitleEdit /convert "example.srt" srt /FixCommonErrors /SplitLongLines /overwrite Then the trick is to run /SplitLongLines two more times. After that the final version of the .srt file is laid out quite nicely. I have Break/split long lines set to 37 single line maximum length and 74 line maximum length.

Reading through your post you've outlined some very good guidelines of how it could work. I'm not sure if my solution will be all your looking for, but maybe it'll help a little bit.

oep42 commented 2 weeks ago

Thanks for sharing your workaround. Your workaround is creative in that it uses the same action repeatedly, but I'm afraid I'll pass on it.

I did some experiments with --word_timestamps true and then merging the separate words in Subtitle Edit afterwards. For me, the results of that had a less accurate synchronisation between subtitle and spoken words, than an alternative way of generating subtitles without --word_timestamps true.

Also, automatically splitting long lines in Subtitle Edit is problematic, because it often results in an inaccurate synchronisation at the point where the split was made. Within Subtitle Edit, automatically splitting things is unsafe — only automatically merging things is safe. So any approach in which already synchronized long lines are automatically split within Subtitle Edit, is unsafe.

However, this idea that automatically splitting long lines is unsafe, only applies when word-level synchronisation is the goal, when the subtitle language is the same as the spoken language. If the subtitle language is a translation of the spoken language, the goal is more like sentence-level synchronisation, and then automatic splitting by Subtitle Edit is not a problem.