Open dailydaniel opened 2 weeks ago
added logging in stt.py and got repeated words in transcribe_file:
segments, transcription_info = whisper.transcribe(
file.file,
task=Task.TRANSCRIBE,
language=language,
initial_prompt=prompt,
word_timestamps="word" in timestamp_granularities,
temperature=temperature,
vad_filter=vad_filter,
hotwords=hotwords,
)
@fedirz any thoughts or updates about this issue?
Same issue here -- tried to transcribe an hour-long lecture and got the repetition of about 4 words a few hundred times until the end of the file.
This is likely an faster-whisper
/ models themselves. I can look further into this if someone provides an English audio sample which a medium or large model hallucinates on. https://github.com/openai/whisper/discussions/679
Does setting ?vad_filter=true
help?
@fedirz I don’t think it’s faster-whisper, because i could not reproduce this issue with only faster-whisper framework. But while debugging faster-whisper-server I found out, that issue comes from stt.py: whisper.transcribe(…
so issue may come from faster-whisper, but it’s not model problem, maybe file is somehow broken or splitted incorrectly.
Is there some kind of silence of more than 30s around the location of the repeated words?
No, I’ve tested it with different files
the same issue with few languages. its happened on any model. I tried with russian and english. most often seen on files 5 min and more
What temperature have you used? Have you tried lowering it?
When running the Whisper model using the faster-whisper-server Docker container, I encounter a transcription issue where the output begins to “hallucinate” after a certain word. The model continuously repeats this word until the end of the transcription output, as shown below:
"Если не вакоеска, то паралляма сейчас обучаем, потому что, ну, это надо прямочень хорошо качать, чтобы шмуф, ну, как бы сейчас вот будет, если люди много нету, то, ну, как бы, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ..."
This problem appears across all models, but it is more pronounced depending on model size and audio file length. For example, the hallucination begins with files around 1-2 MB in size when using the medium model, but with larger files when using the small model. Tested on foreign language.
And this issue does not occur when I run the model directly via the faster-whisper Python library. Below are the details of how I am running the server and using the model in both contexts.
start server:
server client run:
direct run whith faster-whisper framework on same vm and gpu:
I guess the problem might be in how large files are splitted into pieces before being fed into the model, if the splitting tool is not taken from the framework
Thank you!