ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
34.7k stars 3.53k forks source link

[Feature] No speech detection #1026

Open patrickmccallum opened 1 year ago

patrickmccallum commented 1 year ago

I noticed in the whisper.h file that the no_speech_thold is commented as not implemented, I've seen this in the python version from OpenAI and found it to be very useful, or at least getting out the no_speech value per segment for further processing of the output. It was very useful to help discard/prevent hallucinations by silence/background sounds.

Was just wondering what would be involved in getting this up and running and into the whisper.cpp project.

Thanks!

hbf731eF commented 1 year ago

workaround using WebRTC Voice Activity Detector(VAD) py-webrtcvad

pip install webrtcvad

example.py generates chunk files: chunk-00.wav chunk-01.wav chunk-02.wav.. 2 for is a good default python ./example.py <aggressiveness> <your-wav-file>

prepare input.txt for ffmpeg ls -v chunk*.wav >input.txt sed -i 's/^/file /g' input.txt

concatenates audio files from input.txt and generates output.wav with only the voiced segments of <your-wav-file> ffmpeg -f concat -i input.txt -c copy output.wav

patrickmccallum commented 1 year ago

A great library, but sadly doesn't work in my case. Our requirements include offline distribution which makes python a no-go (freeze and packaging tools just aren't there).

vvvm23 commented 1 year ago

I would also really like to see this feature added. I currently get a lot of hallucinations in periods of silence in my inputs.

The code in the original OpenAI implementation that implements this can be found here

I am not an expert with the whisper.cpp codebase, but I think we would need to do the following:

However I am not sure if this would affect anything else, just wanting to kick off the conversation.

aehlke commented 11 months ago

I'm now using libfvad (cross platform, I'm running it on iOS/macOS) to detect speech for enabling transcription