ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.27k stars 3.6k forks source link

How to solve the problem of hallucinations #2040

Open dfengpo opened 6 months ago

pprobst commented 6 months ago

Disabling timestamps helps a lot in my experience (#1724). You can also cut the silence at the end before starting the transcription, or use some form of VAD if you're streaming audio.

bradmurray-dt commented 6 months ago

Additionally, avoid largev3. If the language you are using works well with a smaller model, try it.

r0d0dendr0n commented 6 months ago

@bradmurray-dt can you please elaborate on why to avoid largev3 in context of avoiding hallucinations?

pprobst commented 6 months ago

@bradmurray-dt can you please elaborate on why to avoid largev3 in context of avoiding hallucinations?

While I have not tested v3 myself, several people reported hallucinations with it. Here's an article by Deepgram describing the problem.

bradmurray-dt commented 6 months ago

@bradmurray-dt can you please elaborate on why to avoid largev3 in context of avoiding hallucinations?

I have ran quite a few tests and noticed significantly higher hallucinations with large v3 than other models. Even outside of this, with dirty audio, I find higher hallucinations with medium than small, and higher with large than with medium. Others (including deepgram) have come to similar conclusions. We pre-process audio with a combination of a VAD and a classifier to filter out most non-speech audio. This has had a large improvement in both hallucination, and reducing random missing pieces of transcripts.