fedirz / faster-whisper-server

https://hub.docker.com/r/fedirz/faster-whisper-server
MIT License
754 stars 109 forks source link

Repeated Word Hallucination in Transcription Output #134

Open dailydaniel opened 2 weeks ago

dailydaniel commented 2 weeks ago

When running the Whisper model using the faster-whisper-server Docker container, I encounter a transcription issue where the output begins to “hallucinate” after a certain word. The model continuously repeats this word until the end of the transcription output, as shown below:

"Если не вакоеска, то паралляма сейчас обучаем, потому что, ну, это надо прямочень хорошо качать, чтобы шмуф, ну, как бы сейчас вот будет, если люди много нету, то, ну, как бы, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ну, ..."

This problem appears across all models, but it is more pronounced depending on model size and audio file length. For example, the hallucination begins with files around 1-2 MB in size when using the medium model, but with larger files when using the small model. Tested on foreign language.

And this issue does not occur when I run the model directly via the faster-whisper Python library. Below are the details of how I am running the server and using the model in both contexts.

start server:

docker run -it -d --gpus "device=0" \
 -v ~/.cache/huggingface:/root/.cache/huggingface \
 -p 3004:8000 \
 --name faster-whisper \
 --restart unless-stopped \
 fedirz/faster-whisper-server:latest-cuda

server client run:

model = "Systran/faster-whisper-small"
transcript = client.audio.transcriptions.create(
    model=model, file=audio_file
)

direct run whith faster-whisper framework on same vm and gpu:

from faster_whisper import WhisperModel

os.environ["CUDA_VISIBLE_DEVICES"] = device_id
model = WhisperModel(model_size, device="cuda")
segments, info = model.transcribe(input_path)

I guess the problem might be in how large files are splitted into pieces before being fed into the model, if the splitting tool is not taken from the framework

Thank you!

dailydaniel commented 1 week ago

added logging in stt.py and got repeated words in transcribe_file:

segments, transcription_info = whisper.transcribe(
            file.file,
            task=Task.TRANSCRIBE,
            language=language,
            initial_prompt=prompt,
            word_timestamps="word" in timestamp_granularities,
            temperature=temperature,
            vad_filter=vad_filter,
            hotwords=hotwords,
        )

@fedirz any thoughts or updates about this issue?

D3alWyth1T commented 3 days ago

Same issue here -- tried to transcribe an hour-long lecture and got the repetition of about 4 words a few hundred times until the end of the file.

fedirz commented 3 days ago

This is likely an faster-whisper / models themselves. I can look further into this if someone provides an English audio sample which a medium or large model hallucinates on. https://github.com/openai/whisper/discussions/679

Does setting ?vad_filter=true help?

dailydaniel commented 3 days ago

@fedirz I don’t think it’s faster-whisper, because i could not reproduce this issue with only faster-whisper framework. But while debugging faster-whisper-server I found out, that issue comes from stt.py: whisper.transcribe(…

so issue may come from faster-whisper, but it’s not model problem, maybe file is somehow broken or splitted incorrectly.

thiswillbeyourgithub commented 3 days ago

Is there some kind of silence of more than 30s around the location of the repeated words?

dailydaniel commented 3 days ago

No, I’ve tested it with different files

mkaskov commented 3 days ago

the same issue with few languages. its happened on any model. I tried with russian and english. most often seen on files 5 min and more

thiswillbeyourgithub commented 3 days ago

What temperature have you used? Have you tried lowering it?