Open patrickmccallum opened 1 year ago
workaround using WebRTC Voice Activity Detector(VAD) py-webrtcvad
pip install webrtcvad
example.py
generates chunk files: chunk-00.wav chunk-01.wav chunk-02.wav..
2 for python ./example.py <aggressiveness> <your-wav-file>
prepare input.txt for ffmpeg
ls -v chunk*.wav >input.txt
sed -i 's/^/file /g' input.txt
concatenates audio files from input.txt
and generates output.wav
with only the voiced segments of <your-wav-file>
ffmpeg -f concat -i input.txt -c copy output.wav
A great library, but sadly doesn't work in my case. Our requirements include offline distribution which makes python a no-go (freeze and packaging tools just aren't there).
I would also really like to see this feature added. I currently get a lot of hallucinations in periods of silence in my inputs.
The code in the original OpenAI implementation that implements this can be found here
I am not an expert with the whisper.cpp codebase, but I think we would need to do the following:
no_speech_prob
entry to the decoder class and populate this value during the decoder forward pass.whisper_process_logits
function hereHowever I am not sure if this would affect anything else, just wanting to kick off the conversation.
I'm now using libfvad (cross platform, I'm running it on iOS/macOS) to detect speech for enabling transcription
I noticed in the
whisper.h
file that theno_speech_thold
is commented as not implemented, I've seen this in the python version from OpenAI and found it to be very useful, or at least getting out the no_speech value per segment for further processing of the output. It was very useful to help discard/prevent hallucinations by silence/background sounds.Was just wondering what would be involved in getting this up and running and into the whisper.cpp project.
Thanks!