ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
34.97k stars 3.57k forks source link

Wrong and Repeating Result With a Fixed Interval of 5s #1171

Open Sponge-bink opened 1 year ago

Sponge-bink commented 1 year ago

I am using macOS version 12.6.8 and I downloaded and built whisper.app today with the WHISPER_COREML=1 option. When I tested it with an audio file, I noticed that it produced incorrect and repetitive results. Initially, within the first 3 minutes, it accurately transcribed almost all of the text including lyrics from music. However, once the music stopped, it was no longer able to correctly transcribe what followed. Instead, it repeatedly reproduced one line of lyrics that it had recognized earlier at a time interval of 5 seconds.

[00:03:07.000 --> 00:03:09.000]  (♪ 欲しいこと)
[00:03:09.000 --> 00:03:13.000]  (♪ この想いの名前を教えて)
[00:03:13.000 --> 00:03:21.000]  (♪ その気持ちの名前を教えて)
[00:03:35.000 --> 00:03:40.000]  (♪ 君がくれるSOSは)
[00:03:40.000 --> 00:03:45.000]  (♪ 君がくれるSOSは)
[00:03:45.000 --> 00:03:50.000]  (♪ 君がくれるSOSは)
[00:03:50.000 --> 00:03:55.000]  (♪ 君がくれるSOSは)
[00:03:55.000 --> 00:04:00.000]  (♪ 君がくれるSOSは)
[00:04:00.000 --> 00:04:05.000]  (♪ 君がくれるSOSは)
[00:04:05.000 --> 00:04:10.000]  (♪ 君がくれるSOSは)
[00:04:10.000 --> 00:04:15.000]  (♪ 君がくれるSOSは)
[00:04:15.000 --> 00:04:20.000]  (♪ 君がくれるSOSは)
[00:04:20.000 --> 00:04:25.000]  (♪ 君がくれるSOSは)
[00:04:25.000 --> 00:04:30.000]  (♪ 君がくれるSOSは)
[00:04:30.000 --> 00:04:35.000]  (♪ 君がくれるSOSは)
[00:04:35.000 --> 00:04:40.000]  (♪ 君がくれるSOSは)
[00:04:40.000 --> 00:04:45.000]  (♪ 君がくれるSOSは)

But when I trimmed the audio to exclude the music, it can get where it got wrong the first time correctly.

Has anyone seen this problem?

mrfragger commented 1 year ago

yes music and silence it sometimes hallucinates / repeats segments. I always run this script I call it stucksubs and run it on all vtt subs and correct any time segments. Then this script I call repeatsubs to detect any repeating segments. Not much I can do about repeating segments unless run it again but if any time segments repeat I can fix those manually.

So I believe before I started using CoreML medium.en model I'd get 3x and could transcribe up to 34h audio segments. Now using Core ML medium.en get 4-6x so that's great but every once in awhile get massive repeats. So now I just use large model to lessen the chance of repeating subs and it seems better and I get 1.7x with Core ML and keep audio segments max at 25h (maybe this limitation is cuz my Mac M1 only has 8GB RAM..who knows). Usually say I have a 99 hour audiobook then I split them up into five 20 hour segments then just merge subtitles into one vtt with this script.

When you first run Core ML large model it takes 11 hours before it even starts processing audio file.