[Feature] No speech detection

ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++

MIT License

35.83k stars 3.65k forks source link

[Feature] No speech detection #1026

Open patrickmccallum opened 1 year ago

patrickmccallum commented 1 year ago

I noticed in the whisper.h file that the no_speech_thold is commented as not implemented, I've seen this in the python version from OpenAI and found it to be very useful, or at least getting out the no_speech value per segment for further processing of the output. It was very useful to help discard/prevent hallucinations by silence/background sounds.

Was just wondering what would be involved in getting this up and running and into the whisper.cpp project.

Thanks!

hbf731eF commented 1 year ago

workaround using WebRTC Voice Activity Detector(VAD) py-webrtcvad

pip install webrtcvad

example.py generates chunk files: chunk-00.wav chunk-01.wav chunk-02.wav.. 2 for is a good default python ./example.py <aggressiveness> <your-wav-file>

prepare input.txt for ffmpeg ls -v chunk*.wav >input.txt sed -i 's/^/file /g' input.txt

concatenates audio files from input.txt and generates output.wav with only the voiced segments of <your-wav-file> ffmpeg -f concat -i input.txt -c copy output.wav

patrickmccallum commented 1 year ago

A great library, but sadly doesn't work in my case. Our requirements include offline distribution which makes python a no-go (freeze and packaging tools just aren't there).

vvvm23 commented 1 year ago

I would also really like to see this feature added. I currently get a lot of hallucinations in periods of silence in my inputs.

The code in the original OpenAI implementation that implements this can be found here

I am not an expert with the whisper.cpp codebase, but I think we would need to do the following:

Add a no_speech_prob entry to the decoder class and populate this value during the decoder forward pass.
Extract the probability in the whisper_process_logits function here
I guess push the probability of sampling blank to 100%? Could be done here

However I am not sure if this would affect anything else, just wanting to kick off the conversation.

aehlke commented 1 year ago

I'm now using libfvad (cross platform, I'm running it on iOS/macOS) to detect speech for enabling transcription