ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.11k stars 3.58k forks source link

Data on pervasive hallucination problem with TV episode transcription #1511

Open senorfunes opened 11 months ago

senorfunes commented 11 months ago

Thanks to those who put Whisper.cpp together. I’m writing to share data on a hallucination problem that is pervasive in my use case, which is attempting to generate transcripts for episodes of TV drama in Turkish. I’ll be grateful if anyone has any suggestions for me now, but I also hope this can be useful for improving the project in the long-term.

System: 2023 MacBook Pro, M2 Max, 96GB RAM, OS 13.6 Running Whisper.cpp in line with instructions here: https://medium.com/gimz/how-to-install-whisper-on-mac-openais-speech-to-text-recognition-system-1f6709db6010

After noticing the hallucination problem, I “made” the large-V2 model in line with the suggestion here: https://github.com/ggerganov/whisper.cpp/discussions/1490 This may have improved things, but the issue is still serious. NOTE: I only “made” the large-V2 model once. If that step has to be repeated before each run, then data below would be for the default large model.

Further notes: I also loaded the baseline version of Whisper available via these instructions: https://github.com/openai/whisper It is present on my machine and running, insofar as it appeared to install without error and I can generate the help menu. But I have faced numerous errors trying to get this to run. I am not a coder and do not know how to approach the error messages I’ve received. I also ran the code [pip install openai-whisper==20230308] in line with the suggestions here: https://github.com/openai/whisper/discussions/1059 I assume this code affects the baseline version of Whisper, and not Whisper.cpp, but I am not certain of that. Based on the forums and solution suggestions below, I imagine that changing the “on previous text” condition might be helpful, as could cleansing the non-dialog audio. I’m unclear on whether these are possible within Whisper.cpp: https://github.com/openai/whisper/discussions/1059 https://github.com/openai/whisper/discussions/679 https://github.com/fleek/VADtransciber https://github.com/EtienneAb3d/WhisperHallu https://github.com/EtienneAb3d/WhisperTimeSync

As far as I can tell, hallucinations are sometimes triggered by a lack of dialog, but this may not always be the case. They appear to come in at least two types: (1) “junk” lines that are created by the AI in keeping with what it is accustomed to for silences from its training data (e.g.: the Turkish versions of “subtitled by x”, “thank you for watching”, etc).; (2) dialog lines that simply get stuck on repeat. Both types of hallucinations tend to continue for minutes on end. I only include data below for those hallucinations that either (1) generate more than 3 lines of time-coded entry or (2) last for more than 30 seconds. I label them as “H1" (Hallucination type 1) or “H2" (Hallucination type 2) in line with the above distinction. I use “T” to designate plausible Transcription.

Data on hallucinations from 4x (49-64 minute) episodes of a single TV show transcribed in Whisper.cpp with the following command: ./main -m models/ggml-large.bin -l tr -osrt -f filepath

E(pisose)1 Total time: 01:03:46 H1: 00:00:00 - 00:01:29 T: 00:01:29 - 00:02:29 H2: 00:02:29 - 00:02:30 (14 lines of timecode) T: 00:02:30 - 00:06:28 H1: 00:06:28 - 00:06:58 (1 line of timecode) H1: 00:06:58 - 00:14:27 (different message, 15 lines of timecode) H1: 00:14:27 - 00:43:07 (different message, many pages of timecode) H2: 00:43:07 - 00:52:31 T: 00:52:31 - 00:53:41 H2: 00:53:41 - 00:53:49 H1: 00:53:49 - 00:54:18 H1: 00:54:18 - 00:54:48 T: 00:54:48 - 00:59:21 H2: 00:59:21 - 00:59:28 H1: 00:59:28 - 00:59:58 T: 00:59:58 - 01:02:52 H2: 01:02:52 - 01:03:42 H1: 01:03:42 - 01:03:46

E2 Total time: 0.0:57:36 H1: 00:00:00 - 00:00:29 T: 00:00:30 - 00:00:59 H2: 00:00:59 - 00:00:59 (6 lines of timecode) T: 00:00:59 - 00:02:52 H2: 00:02:59 - 00:03:00 (8 lines of timecode) H1 00:03:00 - 00:04:00 T: 00:04:00 - 00:05:56 H1: 00:06:59 - 00:17:10 T: 00:17:10 - 00:17:52 H2: 00:17:52 - 00:19:50 T: 00:19:50 - 00:29:08 H2: 00:29:08 - 00:57:36

E3 Total time: 00:50:22 H1: 00:00:00 - 00:00:29 T: 00:00:29 - 00:03:27 H2: 00:03:27 - 00:03:27 (4 lines of timecode) T: 00:03:27 - 00:04:42 H1: 00:04:42 - 00:12:42 T: 00:12:42 - 00:15:35 H2: 00:15:35 - 00:16:13 T: 00:16:13 - 00:18:37 H2: 00:18:37 - 00:18:45 T: 00:18:45 - 00:18:47 H2: 00:18:47 - 00:18:59 H2: 00:18:59 - 00:50:22 (different line than previous H2)

E4 Total time: 00:49:29 H1: 00:00:00 - 00:08:07 T: 00:08:07 - 00:12:37 H2: 00:12:37 - 00:17:58 T: 00:17:58 - 00:19:48 H2: 00:19:55 - 00:27:51 T: 00:27:51 - 00:27:53 T/H2: 00:27:53 - 00:29:39 (Most lines repeated 2-3 times, but dialog continues) H2: 00:29:39 - 00:30:29 T: 00:30:29 - 00:31:01 H2: 00:31:01 - 00:31:13 T/H2: 00:31:13 - 00:31:23 (Most lines repeated 2-3 times, but dialog continues) H2: 00:31:23 - 00:32:11 T/H2: 00:32:11 - 00:32:33 H2: 00:32:33 - 00:49:29

With my thanks and best wishes, -Josh

ggerganov commented 11 months ago

If you add -mc 0 to your command, do the results improve?

senorfunes commented 11 months ago

Thanks for your reply! I ran the first episode above again with the following two variations on the prompt: ./main -m models/ggml-large.bin -l tr -mc 0 -osrt -f filepath ./main -mc 0 -m models/ggml-large.bin -l tr -osrt -f filepath Both resulted in exactly the same output as the fist trial. I had expected it to regenerate a response, perhaps with similar but varied hallucinations and transcriptions but, instead, the output appears to be line-for-line the same. I welcome any further suggestions, and am happy to share terminal output records if those are of interest.

ggerganov commented 11 months ago

I assume you used large v2, correct? From the commands you posted, it is not clear since it says ggml-large.bin. Make sure to use v2.

senorfunes commented 11 months ago

Thanks again! I thought I was using large-v2 it because I had "made" it yesterday, but you're right: the prompt I was using was likely just calling up the default large (presumably v3) model. The following prompt improved things considerably:

whisper.cpp-master % ./main -m ./models/ggml-large-v2.bin -l tr -mc 0 -osrt -f filepath

E1 further trial: Total time: 01:03:46 H1: 00:00:00 - 00:00:09 H1: 00:00:09 - 00:05:29 T: 00:05:29 - 00:05:49 H1: 00:05:49 - 00:06:21 T: 00:06:21 - 00:06:29 H1: 00:06:29 - 00:09:51 T: 00:09:51 - 00:09:53 H1: 00:09:53 - 00:10:03 T: 00:10:03 - 00:10:05 H1: 00:10:05 - 00:10:09 T: 00:10:09 - 00:10:31 H1: 00:10:31 - 00:26:03 * T: 00:26:03 - 00:30:21 H1: 00:30:21 - 00:33:45 T: 00:33:45 - 00:33:47 H1: 00:33:47 - 00:37:03 T: 00:37:03 - 00:37:13 H1: 00:37:13 - 00:37:35 T: 00:37:35 - 00:38:01 H1: 00:38:01 - 00:38:41 T: 00:38:41 - 00:47:37 H1: 00:47:37 - 00:48:19 T: 00:48:19 - 00:49:28 H1: 00:49:28 - 00:51:07 T: 00:51:07 - 01:03:47

There are still clearly problem points, most notably the 16 minute section that I've highlighted with the asterisk, but this is definitely better than before.

Thanks again. I welcome any further suggestions, and will be happy to share data or try new tweaks if that would be of assistance to the project.

Best wishes,

Josh

senorfunes commented 11 months ago

A follow-up: The order of operations in the above prompt was preventing -mc 0 (and, for that matter, -osrt) from taking effect. I had much improved (comparably excellent) output with the following: whisper.cpp-master % ./main -osrt -mc 0 -l tr -m ./models/ggml-large-v2.bin -f filepath Thanks again!