ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.95k stars 3.67k forks source link

SRT output erroneously has speech where quiet should be... #1108

Open janngobble opened 1 year ago

janngobble commented 1 year ago

When using SRT as output, the following happens (every single time) when there is speaking, then pause (or undetected music) and speaking again:

sentence sentence sentence (here's where the speaking pause happens - and nothing should be here) sentence (at the no-voice section...but the end timestamp of this speech line is at the end of this sentence.

ie: all speaking before music/quiet is at correct time stamp...but where there is music (undetected) and/or a quiet pause, it puts the next speech line...where the quiet pause starts.

Should be: 00:00 speech 1 00:10 speech 2 00:20 speech 3 00:30 QUIET for 30 seconds 01:00 speech 4 01:10 speech 5

instead I get 00:00 speech 1 00:10 speech 2 00:20 speech 3 00:30 speech 4 01:10 speech 5

ie: it pushes the speech at the quiet part, then holds it onscreen til that portion of speech is spoken and ends.

I can find a true example if you wish, but all sources I try do the same thing. (it's been this way for months, btw).

mrfragger commented 1 year ago

that's cuz it always sets the beginning timestamp of the next segment as the same as the ending timestamp of the previous segment..and NOT the actual time of start of speaking.

janngobble commented 1 year ago

that's cuz it always sets the beginning timestamp of the next segment as the same as the ending timestamp of the previous segment..and NOT the actual time of start of speaking.

That would be a bug in the output of SRT, then, right? Or a bug in the segmentation not providing actual timestamp of speech.

OnlyWick commented 1 year ago

Same issue, and there will be problems with its sentence segmentation.

image

I would like it to divide a sentence based on pauses, or divide a sentence based on commas or periods.

xianxingg commented 1 year ago

the reason lies in whisper_full_with_state. you may alter t0 = t1; to t0 = seek + 2*(tokens_cur[i].tid - whisper_token_beg(ctx));

OnlyWick commented 1 year ago

@xianxingg I did what you said. This doesn't work. Can you fork this repository and modify it, and put your repository link here.

xianxingg commented 1 year ago

@OnlyWick sorry but the solution is only for inaccurate start timestamp ( subtitles show ahead of time ). Your case seems to be more complicated.

For example: Master Branch (there be no subtitle in the red part, but previous ending timestamp is simply set as next beginning one): before

Modified (subtitles show as the speaker speaks): after

The MP4 source: https://github.com/ggerganov/whisper.cpp/assets/11285464/a22374af-ec6a-4b23-a1b5-f72b4a5bcd2f

janngobble commented 1 year ago

Can we please get this committed to main somehow, @ggerganov?

I would love to use it but don’t want it overwritten with my next pull...

On Aug 5, 2023, at 12:46 PM, xx @.***> wrote:

the reason lies in whisper_full_with_state. you may alter t0 = t1; to t0 = seek + 2(tokens_cur[i].tid - whisper_token_beg(ctx)); if (params.speed_up) t0 = 2;

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/issues/1108#issuecomment-1666551668, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB6NBGLNRKVWL3HAIDLJJLXTZ2GTANCNFSM6AAAAAA2LPS6FA. You are receiving this because you authored the thread.

janngobble commented 1 year ago

At about line 4705 in whisper.cpp?

Sorry, but just want to confirm…

TIA Jann

On Aug 5, 2023, at 12:46 PM, xx @.***> wrote:

the reason lies in whisper_full_with_state. you may alter t0 = t1; to t0 = seek + 2(tokens_cur[i].tid - whisper_token_beg(ctx)); if (params.speed_up) t0 = 2;

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/issues/1108#issuecomment-1666551668, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB6NBGLNRKVWL3HAIDLJJLXTZ2GTANCNFSM6AAAAAA2LPS6FA. You are receiving this because you authored the thread.

OnlyWick commented 1 year ago

@xianxingg Wow, looks good. Maybe I modified it wrong!?

Begin from "whisper.cpp" file 4713 line.

text = "";
while (i < (int) tokens_cur.size() && tokens_cur[i].id > whisper_token_beg(ctx)) {
    i++;
}
i--;
t0 = t1;  // Is it this line? I modified this line!
i0 = i + 1;
speaker_turn_next = false;

In addition, I care more about its sentence-breaking function. I'm curious about it can identify words accurately, but it cann't sentence-breaking accurately. :-(

xianxingg commented 1 year ago

Can we please get this committed to main somehow, @ggerganov? I would love to use it but don’t want it overwritten with my next pull... On Aug 5, 2023, at 12:46 PM, xx @.**> wrote: the reason lies in whisper_full_with_state. you may alter t0 = t1; to t0 = seek + 2(tokens_cur[i].tid - whisper_token_beg(ctx)); if (params.speed_up) t0 *= 2; — Reply to this email directly, view it on GitHub <#1108 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB6NBGLNRKVWL3HAIDLJJLXTZ2GTANCNFSM6AAAAAA2LPS6FA. You are receiving this because you authored the thread.

It's a temporary change and may have some side effects, eg Karaoke-style movie generation. For srt, subtitles should show as speaker speaks (no spoiler), but for karaoke, subtitles should show earlier if possible (as background).

janngobble commented 1 year ago

@xianxingg said:

It's a temporary change and may have some side effects, eg Karaoke-style movie generation. For srt, subtitles should show as speaker speaks (no spoiler), but for karaoke, subtitles should show earlier if possible (as background).

Well, as of late, my husband has been spoiled by "and the murderer is..." 60 seconds before the picture is showing them murder someone! So, I'm happy to use this for now. As I don't use Karaoke-style movie generation, and mostly use for SRTs and for podcast audio transcripts, I'm happy with how it is generating with this fix!

Maybe the equivalent of "if type eq 'srt' then this change else the other: t0=t1;

Just wondering...

janngobble commented 1 year ago

In addition, I care more about its sentence-breaking function. I'm curious about it can identify words accurately, but it cann't sentence-breaking accurately. :-(

@OnlyWick:

Re: Sentence-breaking… When using SRT format, I’ve found medium does not sentence break anywhere near as well as large does. (in the CORE_ML version anyway). Large also does speaker-ID’ing better. the >> in front of new speakers works in large, but is not present in medium output. And the sentences are almost all broken correctly - except for awkward pauses, etc.

Anyone else notice this? I thought it was a feature of spending more time analyzing the speech. I don’t know.

gkovacsp commented 1 month ago

What is the correct way to get this - don't show early - fix?