SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
11.6k stars 962 forks source link

Silero-VAD Meta Hallucinations #843

Open TedTimbrell opened 4 months ago

TedTimbrell commented 4 months ago

I noticed while transcribing some of my own audio that near-silence doesn't get removed during VAD. In fact, running noisereduce actually made the problem dramatically worse, making 10 seconds of falsely detected speech into a minute and a half of falsely detected speech.

Apologies if I'm referring to the wrong version of Silero but it seems like this a known issue / feature(tm). https://github.com/snakers4/silero-vad/issues/396

Preforming a volume filter along with VAD might solve a fair number of hallucinations and might even remove the need to set condition_on_previous_text to False prevent the hallucinations from ruining the rest (section) of the transcription.

I'm down to try it out and open a PR if you all are welcome to it. Before I do though, I'm curious if this came up when adding the hallucination detection logic.

It'd be really nice to have in this library so that I don't have to preform a second layer of timestamp adjustments.

trungkienbkhn commented 4 months ago

@TedTimbrell , hello. Feel free to open a new PR, and could you attach an example audio ?

ngcheeyuan commented 2 months ago

@TedTimbrell any follow up on this?

Petemir commented 2 months ago

fyi latest silero-vad version (v5) seems to have solved this (last comment on the linked issue). Perhaps the model could be updated on faster-whisper?

Edit: Ok, it seems it was done on #884 :) . @TedTimbrell perhaps you could try your pipeline again...