Closed acul3 closed 1 year ago
also hope, thanks
What's the best practice for VAD filtering?
I was thinking adding Silero VAD and setting a relatively high value for min_silence_duration_ms
(e.g. 2 seconds) to remove the audio parts where there is clearly no speech.
Does that make sense? Do you have other approaches or parameters to recommend?
yeah the silero seems appropriate choice,it can runs on CPU
setting a relatively high value for min_speech_duration_ms (e.g. 2 seconds) to remove the audio parts where this is clearly no speech.
yeah i think it can be done like that...
we can set min_speech_duration_ms
as params/variable, so it can be adjust according to certain condition
some ref :
I opened a PR for this feature. Can you help testing and let me know if it works as you'd expect?
You can install the branch with:
pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/refs/heads/vad.tar.gz"
Then it can be enabled with vad_filter=True
:
model.transcribe(..., vad_filter=True, vad_min_silence_duration_ms=2000)
Audio segments without speech for at least vad_min_silence_duration_ms
milliseconds will be ignored. The default value is 2000 (2 seconds).
thanks sure..let me try on different cases
Seems to work for me. I'm not sure the best way to test it but with content that would usually run into errors due to music they seem to be gone now
I tested vad and found segment timestamp have problem if vad_min_silence_duration_ms=20.
cause: previous end time > next_start time
voice source: https://www.youtube.com/shorts/GNicgvdoCpc model size: medium
model.transcribe( audio, language="zh", vad_filter=True, vad_min_silence_duration_ms=20)
result:
This is a general issue with Whisper which can predict timestamps bigger than the audio duration (see for example https://github.com/openai/whisper/discussions/124). I'm not sure what's the best solution at this time.
Also, I don't think it is helpful to use a value this small for vad_min_silence_duration_ms
. At least it will be very inefficient because the audio will be segmented into small chunks that are then padded to 30 seconds as required by the Whisper model.
EDIT: a better approach seems to concatenate all audio chunks and then restore the timestamps after transcription.
I'm getting this error occasionally:
1|api | STDERR: Traceback (most recent call last):
1|api | File "/usr/local/bin/whisper-ctranslate2", line 8, in <module>
1|api | sys.exit(main())
1|api | File "/usr/local/lib/python3.8/dist-packages/src/whisper_ctranslate2/whisper_ctranslate2.py", line 330, in main
1|api | result = Transcribe().inference(
1|api | File "/usr/local/lib/python3.8/dist-packages/src/whisper_ctranslate2/transcribe.py", line 129, in inference
1|api | for segment in segments:
1|api | File "/usr/local/lib/python3.8/dist-packages/faster_whisper/transcribe.py", line 329, in transcribe_chunks
1|api | for segment in segments:
1|api | File "/usr/local/lib/python3.8/dist-packages/faster_whisper/transcribe.py", line 489, in generate_segments
1|api | self.add_word_timestamps(
1|api | File "/usr/local/lib/python3.8/dist-packages/faster_whisper/transcribe.py", line 658, in add_word_timestamps
1|api | alignment = self.find_alignment(
1|api | File "/usr/local/lib/python3.8/dist-packages/faster_whisper/transcribe.py", line 723, in find_alignment
1|api | words, word_tokens = tokenizer.split_to_word_tokens(
1|api | File "/usr/local/lib/python3.8/dist-packages/faster_whisper/tokenizer.py", line 111, in split_to_word_tokens
1|api | return self.split_tokens_on_spaces(tokens)
1|api | File "/usr/local/lib/python3.8/dist-packages/faster_whisper/tokenizer.py", line 143, in split_tokens_on_spaces
1|api | subwords, subword_tokens_list = self.split_tokens_on_unicode(tokens)
1|api | File "/usr/local/lib/python3.8/dist-packages/faster_whisper/tokenizer.py", line 130, in split_tokens_on_unicode
1|api | or decoded_full[unicode_offset + decoded.index(replacement_char)]
1|api | IndexError: string index out of range
Not sure if it's an existing bug with Whisper or if it's something from the new implementation
This is a general issue with Whisper which can predict timestamps bigger than the audio duration (see for example openai/whisper#124). I'm not sure what's the best solution at this time.
Also, I don't think it is helpful to use a value this small for
vad_min_silence_duration_ms
. At least it will be very inefficient because the audio will be segmented into small chunks that are then padded to 30 seconds as required by the Whisper model.
thanks. I resolved when enable word_timestamp. It looks like word_timestamp will readjust the start and end time.
Is there a plan to open the setting of other vad parameters in the transcribe() ? Ex: threshold, min_speech_duration_ms, max_speech_duration_s, window_size_samples, speech_pad_ms
Not sure if it's an existing bug with Whisper or if it's something from the new implementation
I think I also saw this error in another Whisper repo but I can no longer find it. At least it is not related to the VAD development.
Is there a plan to open the setting of other vad parameters in the transcribe() ?
In that case I will probably add a single dict
parameter vad_parameters
:
model.transcribe(..., vad_filter=True, vad_parameters=dict(min_silence_duration_ms=2000))
Is there a plan to open the setting of other vad parameters in the transcribe() ?
In that case I will probably add a single
dict
parametervad_parameters
:model.transcribe(..., vad_filter=True, vad_parameters=dict(min_silence_duration_ms=2000))
thanks
@guillaumekln I just saw this and wanted to share some experiences I collected when experimenting with Silero VAD as a preprocessor to Whisper. Maybe it is relevant here as well.
In general Silero VAD is very precise and reliable BUT results heavily depend on the context length! My plan was to stream chunks of audio, scan each chunk with Silero and decide when there is a good spot to start processing the currently buffered chunks, ending one "context block" and continuing with the next. Basically I wanted to split the stream at the best possible position to keep enough of the context length and accuracy. What I learned is that accuracy drops significantly for short chunks (<3s maybe), depending a little bit on what happens within this period. If this is basically just silence Silero will try to find voices in the silence, probably because it has no reference yet for the real signal.
Do you apply Silero VAD once on the full file?
Yes, Silero VAD is run once on the complete audio.
Thank you for releasing the code
since this implementation require less memory than other implementation adding VAD (Voice activity detection) should be more suitable Voice activity detection make whisper more accurate especially for non english
(https://github.com/openai/whisper/discussions/29#discussioncomment-3726710)
will this possible to add ? thank you