Trouble with large-v3 Model in Real-Time Speech-to-Text

ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++

MIT License

34.28k stars 3.48k forks source link

Trouble with large-v3 Model in Real-Time Speech-to-Text #1497

Open guillegarciam00 opened 9 months ago

guillegarciam00 commented 9 months ago

Experiencing issues with real-time transcription using larger models, such as large-v3. While smaller models work effectively, the larger ones produce inaccurate results, often containing placeholders like [silence] instead of recognizing spoken words. This occurs despite utilizing a Nvidia 4090 GPU, suggesting it's not a performance-related problem.

Environment:

OS: Ubuntu 22.04
GPU: Nvidia 4090

Has anyone encountered a similar problem with real-time transcription using large models? Any insights or solutions would be greatly appreciated.

Alumniminium commented 9 months ago

largev3 is absolute sh*t, try v2, works fine for me

bobqianic commented 9 months ago

Experiencing issues with real-time transcription using larger models, such as large-v3.

The Large-v3 model faces significant issues. Numerous reports indicate it exhibits a higher Word Error Rate (WER) and increased instances of hallucinations.

While smaller models work effectively

Only the large models are consistently updated, while the smaller versions like tiny, base, small, and medium remain as the original v1. OpenAI has not released v2 or v3 for these smaller models.

pannous commented 9 months ago

| issues with larger models, such as large-v3 ... | Only the large models are consistently updated

How is that not a contradiction?

bobqianic commented 9 months ago

How is that not a contradiction?

The more it updates, the worse it gets. LOL. Honestly, V2 is better than V1, but V3 used a 4 million-hour pseudo training set, which is probably why the performance has deteriorated.