SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
12.7k stars 1.06k forks source link

Feature : Add support for VAD filter #39

Closed acul3 closed 1 year ago

acul3 commented 1 year ago

Thank you for releasing the code

since this implementation require less memory than other implementation adding VAD (Voice activity detection) should be more suitable Voice activity detection make whisper more accurate especially for non english

(https://github.com/openai/whisper/discussions/29#discussioncomment-3726710)

will this possible to add ? thank you

anyshu commented 1 year ago

also hope, thanks

guillaumekln commented 1 year ago

What's the best practice for VAD filtering?

I was thinking adding Silero VAD and setting a relatively high value for min_silence_duration_ms (e.g. 2 seconds) to remove the audio parts where there is clearly no speech.

Does that make sense? Do you have other approaches or parameters to recommend?

acul3 commented 1 year ago

yeah the silero seems appropriate choice,it can runs on CPU

setting a relatively high value for min_speech_duration_ms (e.g. 2 seconds) to remove the audio parts where this is clearly no speech.

yeah i think it can be done like that... we can set min_speech_duration_ms as params/variable, so it can be adjust according to certain condition

some ref :

https://github.com/m-bain/whisperX/pull/103

https://github.com/jianfch/stable-ts/blob/d44d287cbf93d1ad703359023fcb3f4ebbe02d46/stable_whisper/stabilization.py#L245

guillaumekln commented 1 year ago

I opened a PR for this feature. Can you help testing and let me know if it works as you'd expect?

You can install the branch with:

pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/refs/heads/vad.tar.gz"

Then it can be enabled with vad_filter=True:

model.transcribe(..., vad_filter=True, vad_min_silence_duration_ms=2000)

Audio segments without speech for at least vad_min_silence_duration_ms milliseconds will be ignored. The default value is 2000 (2 seconds).

acul3 commented 1 year ago

thanks sure..let me try on different cases

mayeaux commented 1 year ago

Seems to work for me. I'm not sure the best way to test it but with content that would usually run into errors due to music they seem to be gone now

johnchienbronci commented 1 year ago

I tested vad and found segment timestamp have problem if vad_min_silence_duration_ms=20.

cause: previous end time > next_start time

voice source: https://www.youtube.com/shorts/GNicgvdoCpc model size: medium

model.transcribe( audio, language="zh", vad_filter=True, vad_min_silence_duration_ms=20)

result:

image
guillaumekln commented 1 year ago

This is a general issue with Whisper which can predict timestamps bigger than the audio duration (see for example https://github.com/openai/whisper/discussions/124). I'm not sure what's the best solution at this time.

Also, I don't think it is helpful to use a value this small for vad_min_silence_duration_ms. At least it will be very inefficient because the audio will be segmented into small chunks that are then padded to 30 seconds as required by the Whisper model.

EDIT: a better approach seems to concatenate all audio chunks and then restore the timestamps after transcription.

mayeaux commented 1 year ago

I'm getting this error occasionally:

1|api  | STDERR: Traceback (most recent call last):
1|api  |   File "/usr/local/bin/whisper-ctranslate2", line 8, in <module>
1|api  |     sys.exit(main())
1|api  |   File "/usr/local/lib/python3.8/dist-packages/src/whisper_ctranslate2/whisper_ctranslate2.py", line 330, in main
1|api  |     result = Transcribe().inference(
1|api  |   File "/usr/local/lib/python3.8/dist-packages/src/whisper_ctranslate2/transcribe.py", line 129, in inference
1|api  |     for segment in segments:
1|api  |   File "/usr/local/lib/python3.8/dist-packages/faster_whisper/transcribe.py", line 329, in transcribe_chunks
1|api  |     for segment in segments:
1|api  |   File "/usr/local/lib/python3.8/dist-packages/faster_whisper/transcribe.py", line 489, in generate_segments
1|api  |     self.add_word_timestamps(
1|api  |   File "/usr/local/lib/python3.8/dist-packages/faster_whisper/transcribe.py", line 658, in add_word_timestamps
1|api  |     alignment = self.find_alignment(
1|api  |   File "/usr/local/lib/python3.8/dist-packages/faster_whisper/transcribe.py", line 723, in find_alignment
1|api  |     words, word_tokens = tokenizer.split_to_word_tokens(
1|api  |   File "/usr/local/lib/python3.8/dist-packages/faster_whisper/tokenizer.py", line 111, in split_to_word_tokens
1|api  |     return self.split_tokens_on_spaces(tokens)
1|api  |   File "/usr/local/lib/python3.8/dist-packages/faster_whisper/tokenizer.py", line 143, in split_tokens_on_spaces
1|api  |     subwords, subword_tokens_list = self.split_tokens_on_unicode(tokens)
1|api  |   File "/usr/local/lib/python3.8/dist-packages/faster_whisper/tokenizer.py", line 130, in split_tokens_on_unicode
1|api  |     or decoded_full[unicode_offset + decoded.index(replacement_char)]
1|api  | IndexError: string index out of range

Not sure if it's an existing bug with Whisper or if it's something from the new implementation

johnchienbronci commented 1 year ago

This is a general issue with Whisper which can predict timestamps bigger than the audio duration (see for example openai/whisper#124). I'm not sure what's the best solution at this time.

Also, I don't think it is helpful to use a value this small for vad_min_silence_duration_ms. At least it will be very inefficient because the audio will be segmented into small chunks that are then padded to 30 seconds as required by the Whisper model.

thanks. I resolved when enable word_timestamp. It looks like word_timestamp will readjust the start and end time.

image
johnchienbronci commented 1 year ago

Is there a plan to open the setting of other vad parameters in the transcribe() ? Ex: threshold, min_speech_duration_ms, max_speech_duration_s, window_size_samples, speech_pad_ms

guillaumekln commented 1 year ago

Not sure if it's an existing bug with Whisper or if it's something from the new implementation

I think I also saw this error in another Whisper repo but I can no longer find it. At least it is not related to the VAD development.

Is there a plan to open the setting of other vad parameters in the transcribe() ?

In that case I will probably add a single dict parameter vad_parameters:

model.transcribe(..., vad_filter=True, vad_parameters=dict(min_silence_duration_ms=2000))
johnchienbronci commented 1 year ago

Is there a plan to open the setting of other vad parameters in the transcribe() ?

In that case I will probably add a single dict parameter vad_parameters:

model.transcribe(..., vad_filter=True, vad_parameters=dict(min_silence_duration_ms=2000))

thanks

fquirin commented 1 year ago

@guillaumekln I just saw this and wanted to share some experiences I collected when experimenting with Silero VAD as a preprocessor to Whisper. Maybe it is relevant here as well.

In general Silero VAD is very precise and reliable BUT results heavily depend on the context length! My plan was to stream chunks of audio, scan each chunk with Silero and decide when there is a good spot to start processing the currently buffered chunks, ending one "context block" and continuing with the next. Basically I wanted to split the stream at the best possible position to keep enough of the context length and accuracy. What I learned is that accuracy drops significantly for short chunks (<3s maybe), depending a little bit on what happens within this period. If this is basically just silence Silero will try to find voices in the silence, probably because it has no reference yet for the real signal.

Do you apply Silero VAD once on the full file?

guillaumekln commented 1 year ago

Yes, Silero VAD is run once on the complete audio.