ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.47k stars 3.61k forks source link

Incorrect timetstamps #2271

Open thewh1teagle opened 4 months ago

thewh1teagle commented 4 months ago

When transcribing the following file, the timestamps are incorrect. As you can see the start timestamp of the second segment is the same as the end timestamp of the previous one, although there's a gap of few seconds between.

https://github.com/ggerganov/whisper.cpp/assets/61390950/bbf9d9c4-3d60-4693-832d-e48135edf379

transcript.srt ```srt 1 00:00:00,000 --> 00:00:08,700 *music* I just wanna tell you how I'm feeling. Gotta make you understand. 2 00:00:08,700 --> 00:00:18,080 Never gonna give you up, never gonna let you down. 3 00:00:18,080 --> 00:00:25,300 Never gonna run around and... ```
transcript.json ```json [ { "start": 0, "stop": 870, "text": " *music* I just wanna tell you how I'm feeling. Gotta make you understand." }, { "start": 870, "stop": 1808, "text": " Never gonna give you up, never gonna let you down." }, { "start": 1808, "stop": 2530, "text": " Never gonna run around and..." } ] ```
word_timestamps.json ```json [ { "start": 0, "stop": 3, "text": "" }, { "start": 3, "stop": 200, "text": " *music*" }, { "start": 200, "stop": 211, "text": " I" }, { "start": 211, "stop": 257, "text": " just" }, { "start": 257, "stop": 314, "text": " wanna" }, { "start": 314, "stop": 360, "text": " tell" }, { "start": 360, "stop": 394, "text": " you" }, { "start": 394, "stop": 428, "text": " how" }, { "start": 428, "stop": 462, "text": " I'm" }, { "start": 462, "stop": 576, "text": " feeling." }, { "start": 576, "stop": 633, "text": " Gotta" }, { "start": 633, "stop": 679, "text": " make" }, { "start": 679, "stop": 713, "text": " you" }, { "start": 713, "stop": 870, "text": " understand." }, { "start": 870, "stop": 976, "text": " Never" }, { "start": 976, "stop": 1082, "text": " gonna" }, { "start": 1082, "stop": 1167, "text": " give" }, { "start": 1167, "stop": 1231, "text": " you" }, { "start": 1231, "stop": 1417, "text": " up," }, { "start": 1417, "stop": 1421, "text": " never" }, { "start": 1421, "stop": 1527, "text": " gonna" }, { "start": 1527, "stop": 1591, "text": " let" }, { "start": 1591, "stop": 1655, "text": " you" }, { "start": 1655, "stop": 1808, "text": " down." }, { "start": 1808, "stop": 1924, "text": " Never" }, { "start": 1924, "stop": 2040, "text": " gonna" }, { "start": 2040, "stop": 2109, "text": " run" }, { "start": 2109, "stop": 2266, "text": " around" }, { "start": 2266, "stop": 2530, "text": " and..." } ] ```
SimpleVictor commented 4 months ago

@thewh1teagle How did you generate the word_timestamps.json. Was there a specific param I need to pass?

thewh1teagle commented 4 months ago

@SimpleVictor See https://github.com/tazz4843/whisper-rs/issues/156#issuecomment-2195482588 Basically you need to set max_len to how many characters you want, and enable split_on_word so it will keep the words instead of cutting in the middle and then just get the text segments

thewh1teagle commented 3 months ago

I found another weird wrong timestamps when word timestamps enabled.

https://github.com/user-attachments/assets/02b99bf4-c6af-409f-a878-82771768ca39

Open the details and search for "start": 764, and see that one segment after has smaller start timestamp.

transcript.json ```json [ { "start": 0, "stop": 19, "text": "" }, { "start": 19, "stop": 34, "text": " What" }, { "start": 34, "stop": 48, "text": " do" }, { "start": 48, "stop": 72, "text": " you" }, { "start": 72, "stop": 112, "text": " think" }, { "start": 112, "stop": 151, "text": " about" }, { "start": 151, "stop": 191, "text": " like" }, { "start": 191, "stop": 216, "text": " when" }, { "start": 216, "stop": 248, "text": " Elon" }, { "start": 248, "stop": 272, "text": " was" }, { "start": 272, "stop": 336, "text": " causing" }, { "start": 336, "stop": 384, "text": " calling" }, { "start": 384, "stop": 408, "text": " for" }, { "start": 408, "stop": 416, "text": " a" }, { "start": 416, "stop": 456, "text": " pause" }, { "start": 456, "stop": 474, "text": " on" }, { "start": 474, "stop": 494, "text": " AI" }, { "start": 764, "stop": 514, "text": " He" }, { "start": 514, "stop": 530, "text": " was" }, { "start": 530, "stop": 587, "text": " like" }, { "start": 587, "stop": 670, "text": " starting" }, { "start": 670, "stop": 711, "text": " then" }, { "start": 711, "stop": 721, "text": " a" }, { "start": 721, "stop": 762, "text": " GI" }, { "start": 762, "stop": 815, "text": " company" }, { "start": 815, "stop": 867, "text": " while" }, { "start": 867, "stop": 888, "text": " he" }, { "start": 888, "stop": 919, "text": " was" }, { "start": 919, "stop": 971, "text": " doing" }, { "start": 971, "stop": 1018, "text": " that" }, { "start": 1104, "stop": 1257, "text": " Yeah," }, { "start": 1257, "stop": 1272, "text": " so" }, { "start": 1272, "stop": 1310, "text": " didn't" }, { "start": 1310, "stop": 1323, "text": " he" }, { "start": 1323, "stop": 1357, "text": " start" }, { "start": 1357, "stop": 1367, "text": " it" }, { "start": 1367, "stop": 1414, "text": " like" }, { "start": 1414, "stop": 1431, "text": " after" }, { "start": 1431, "stop": 1446, "text": " he" }, { "start": 1446, "stop": 1464, "text": " was" }, { "start": 1464, "stop": 1512, "text": " calling" }, { "start": 1512, "stop": 1532, "text": " for" }, { "start": 1532, "stop": 1553, "text": " the" }, { "start": 1553, "stop": 1605, "text": " pause." }, { "start": 1605, "stop": 1628, "text": " I" }, { "start": 1694, "stop": 1658, "text": " Think" }, { "start": 1658, "stop": 1694, "text": " before" }, { "start": 1694, "stop": 1712, "text": " but" }, { "start": 1712, "stop": 1718, "text": " I" }, { "start": 1718, "stop": 1748, "text": " don't" }, { "start": 1748, "stop": 1772, "text": " know" }, { "start": 1772, "stop": 1784, "text": " in" }, { "start": 1784, "stop": 1803, "text": " any" }, { "start": 1803, "stop": 1832, "text": " cases" }, { "start": 1832, "stop": 1850, "text": " one" }, { "start": 1850, "stop": 1866, "text": " of" }, { "start": 1866, "stop": 1892, "text": " those" }, { "start": 1892, "stop": 1910, "text": " you" }, { "start": 1910, "stop": 1940, "text": " can't" }, { "start": 1940, "stop": 1964, "text": " beat" }, { "start": 1964, "stop": 1981, "text": " him" }, { "start": 1981, "stop": 2006, "text": " join" }, { "start": 2006, "stop": 2030, "text": " them" }, { "start": 2030, "stop": 2084, "text": " things." }, { "start": 2084, "stop": 2108, "text": " Um," }, { "start": 2108, "stop": 2126, "text": " I" }, { "start": 2410, "stop": 2185, "text": " Think" }, { "start": 2185, "stop": 2220, "text": " the" }, { "start": 2220, "stop": 2315, "text": " instinct" }, { "start": 2315, "stop": 2338, "text": " of" }, { "start": 2338, "stop": 2430, "text": " saying" }, { "start": 2430, "stop": 2461, "text": " like" }, { "start": 2461, "stop": 2518, "text": " we've" }, { "start": 2518, "stop": 2585, "text": " really" }, { "start": 2585, "stop": 2620, "text": " got" }, { "start": 2620, "stop": 2643, "text": " to" }, { "start": 2643, "stop": 2714, "text": " figure" }, { "start": 2714, "stop": 2756, "text": " out" }, { "start": 2756, "stop": 2784, "text": " how" }, { "start": 2784, "stop": 2820, "text": " to" }, { "start": 2840, "stop": 2872, "text": " Make" }, { "start": 2872, "stop": 2920, "text": " this" }, { "start": 2920, "stop": 2970, "text": " safe" }, { "start": 2970, "stop": 3008, "text": " and" }, { "start": 3008, "stop": 3058, "text": " good" }, { "start": 3058, "stop": 3096, "text": " and" }, { "start": 3096, "stop": 3164, "text": " like" }, { "start": 3164, "stop": 3222, "text": " widely" }, { "start": 3222, "stop": 3272, "text": " good" }, { "start": 3272, "stop": 3306, "text": " is" }, { "start": 3306, "stop": 3454, "text": " really" }, { "start": 3454, "stop": 3486, "text": " important" }, { "start": 3486, "stop": 3524, "text": " but" }, { "start": 3524, "stop": 3535, "text": " I" }, { "start": 3535, "stop": 3606, "text": " think" }, { "start": 3816, "stop": 3845, "text": " Calling" }, { "start": 3845, "stop": 3977, "text": " for" }, { "start": 3977, "stop": 4016, "text": " a" }, { "start": 4108, "stop": 4078, "text": " Pause" }, { "start": 4078, "stop": 4091, "text": " is" }, { "start": 4091, "stop": 4133, "text": " like" }, { "start": 4133, "stop": 4188, "text": " naive" }, { "start": 4188, "stop": 4209, "text": " it" }, { "start": 4209, "stop": 4230, "text": " at" }, { "start": 4230, "stop": 4273, "text": " best" }, { "start": 4273, "stop": 4337, "text": " for" }, { "start": 4337, "stop": 4337, "text": " the" }, { "start": 4337, "stop": 4402, "text": " latest" }, { "start": 4402, "stop": 4446, "text": " tech" }, { "start": 4446, "stop": 4543, "text": " insights" }, { "start": 4543, "stop": 4586, "text": " visit" }, { "start": 4586, "stop": 4693, "text": " em" }, { "start": 4693, "stop": 4704, "text": " 360" }, { "start": 4704, "stop": 4750, "text": " tech" }, { "start": 4750, "stop": 4800, "text": " calm" }, { "start": 4800, "stop": 4800, "text": "" }, { "start": 4800, "stop": 4843, "text": " visit" }, { "start": 4843, "stop": 5054, "text": " EM360tech.com." }, { "start": 5054, "stop": 5054, "text": "" }, { "start": 5054, "stop": 6054, "text": " [BLANK_AUDIO]" } ] ```

@ggerganov

Is there a way we can 'tell' whisper the segments instead of letting him segment it? I'm trying to add diarization. But currently the timestamps of whisper.cpp is not entirely accurate. I already have accurate segmentation. but not sure if it will be efficient to execute whisper on segments (speeches) which probably will be shorter than 30s many times causing the whole transcribe to be slower?

The diarization is actually pretty simple and once I'll find an approach to use it along with whisper.cpp I can add it to whisper.cpp / implement in Rust.

https://github.com/thewh1teagle/ort-diarize/blob/main/main.py

majisama commented 6 days ago

What if I want to use it in jni? That is, what if I want to use it in Android?