Softcatala / whisper-ctranslate2

Whisper command line client compatible with original OpenAI client based on CTranslate2.
MIT License
880 stars 75 forks source link

More natural line-wrapping when using --max_line_width #78

Open JonasCz opened 8 months ago

JonasCz commented 8 months ago

By default, Whisper produces subtitles (SRT/VTT) with often quite long line-lengths. For some uses these can be too long for viewers to comfortably read. (a common recommendation is that subtitles should be ~50 characters maximum lenghth). For example, testing with "The Expert"


1
00:00:00,000 --> 00:00:04,440
Our company has a new strategic initiative to increase market penetration,

2
00:00:05,120 --> 00:00:07,720
maximise brand loyalty and enhance intangible assets.

3
00:00:08,080 --> 00:00:13,660
In pursuit of these objectives, we've started a new project for which we require seven red lines.

If I want them shorter, I can use something like --max_line_count 2 --max_line_width 50 which does result in very consistent, short lines, but the current line-wrapping implementation results in subtitles which are quite unnatural to read, due to line- and subtitle- breaks not being on (sub)-sentences.


1
00:00:00,000 --> 00:00:05,800
Our company has a new strategic initiative to
increase market penetration, maximise brand

2
00:00:05,800 --> 00:00:11,700
loyalty and enhance intangible assets. In pursuit
of these objectives, we've started a new project

3
00:00:11,700 --> 00:00:16,480
for which we require seven red lines. I understand
your company can help us in this matter. Of

This PR changes this, by wrapping lines in a more natural way, splitting them on periods or commas if possible, and otherwise on the longest gap around the middle of the too-long line. It results in more natural to read text, while staying within the set --max_line_width constraint:

1
00:00:00,000 --> 00:00:04,440
Our company has a new strategic
initiative to increase market penetration,

2
00:00:05,120 --> 00:00:07,720
maximise brand loyalty and
enhance intangible assets.

3
00:00:08,080 --> 00:00:12,060
In pursuit of these objectives,
we've started a new project for which

4
00:00:12,060 --> 00:00:13,660
we require seven red lines.

I've tested that:

I'm not super familiar with Python, so this code is probably not the nicest. Any feedback is appreciated!

Purfview commented 7 months ago

Does this work with --highlight_words?

JonasCz commented 7 months ago

Yes, testing with --highlight_words True results in "karaoke style" underlined words as expected.

Purfview commented 7 months ago

Did you meant underlined and with "more natural line-wrapping"?

JonasCz commented 7 months ago

Yes, both together works, i.e. --word_timestamps True --highlight_words True --max_line_count 2 --max_line_width 50 gives underlines and natural line wraps as shown above

Purfview commented 7 months ago

Thx, then maybe I'll borrow your PR for my repo to work with "highlight_words" as my implementation of "max_line_width/max_line_count" is not compatible with "highlight_words".

Lycoan commented 7 months ago

@JonasCz, nice extension! Does it detect sentence endings besides period, like '?', '!' and even '-' ?

Anyway, it seems that your fork fails to run when --max_line_width is not given, but --word_timestamps is set to True. It can be checked by the following in the base folder of the repo: whisper-ctranslate2 --model medium --language Catalan --output_format srt --word_timestamps True ./e2e-tests/gossos.mp3

It it also worth running the tests and modify them, if needed (right now they fails unfortunately): make run-tests (the following packages are need to be installed first: pip install torch pyannote.audio)