SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
10.46k stars 880 forks source link

VAD is relatively slow #364

Open AlexandderGorodetski opened 1 year ago

AlexandderGorodetski commented 1 year ago

Hello guys,

I am using VAD of faster whisper using following commands. I found that on TedLium benchmark transcribing VAD takes 8% of time and 92% takes transcribing. I would prefer to decrease time of VAD so that it will not take more than 1%. Is it somehow possible to optimize VAD procedure in terms of real time?? Maybe it is possible to run VAD on several CPU's? BTW, I see that VAD is running on CPU, is it possible to run it somehow on GPU?

# VAD
audio_buffer = decode_audio(audio_filename,
                                sampling_rate=whisper_sampling_rate)

# Get the speech chunks in the given audio buffer, and create a reduced audio buffer that contains only speech.    
speech_chunks = get_speech_timestamps(audio_buffer)
vad_audio_buffer = collect_chunks(audio_buffer, speech_chunks)

# Transribe the reduced audio buffer.
init_segments, _ = whisper_model.transcribe(vad_audio_buffer, language=language_code, beam_size=beam_size)

# Restore the true time-stamps for the segments.
segments = restore_speech_timestamps(init_segments, speech_chunks, whisper_sampling_rate)
hoonlight commented 1 year ago

Lowering the window_size_samples value may help. In faster-whisper, the default is 1024, and you can choose between 512, 1024, and 1536.

https://github.com/snakers4/silero-vad/issues/322#issuecomment-1519015503

guillaumekln commented 1 year ago

The VAD model is also run on a single CPU core:

https://github.com/guillaumekln/faster-whisper/blob/e786e26f75f49b7d638412f3bf2b2b75a9c3c9e8/faster_whisper/vad.py#L254-L255

Can you try changing these values and see how they impact the performance?

phineas-pta commented 1 year ago

u can make vad run on gpu

  1. install dependencies
pip uninstall onnxruntime
pip install onnxruntime-gpu
  1. edit code

in vad.py line 253-262 replace with

        opts = onnxruntime.SessionOptions()
        opts.log_severity_level = 4
        opts.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_BASIC
        # https://github.com/microsoft/onnxruntime/issues/11548#issuecomment-1158314424

        self.session = onnxruntime.InferenceSession(
            path,
            providers=["CUDAExecutionProvider"],
            sess_options=opts,
        )
Purfview commented 1 year ago

Lowering the window_size_samples value may help.

I get faster speed with higher value, is lower faster for you?

 512: VAD speed  58 audio seconds/s - removed 01:37.831 of audio
1024: VAD speed 107 audio seconds/s - removed 01:36.495 of audio
1536: VAD speed 134 audio seconds/s - removed 01:45.383 of audio

Not sure about precision too. 1024 included insignificantly more of non-voice areas vs 1536, but 1536 excluded one voice line in music/song area.

Can you try changing these values and see how they impact the performance?

No impact for me.

u can make vad run on gpu

Could you benchmark VAD. CPU vs GPU?

phineas-pta commented 1 year ago

u have any benchmark code & data ?

Purfview commented 1 year ago

No.

hoonlight commented 1 year ago

I get faster speed with higher value, is lower faster for you?

After seeing your results, I tested it too, and it took longer for lower values of window_size_samples.

512: 23.8 seconds - 296 speech chunks
1024: 12.7 seconds - 288 speech chunks
1536: 10.9 seconds - 298 speech chunks

Not sure about precision too. 1024 included insignificantly more of non-voice areas vs 1536, but 1536 excluded one voice line in music/song area.

I'm not sure about the precision, I'll check it later.

benchmark code: ``` import time from typing import NamedTuple from faster_whisper import vad, audio class VadOptions(NamedTuple): threshold: float = 0.5 min_speech_duration_ms: int = 250 max_speech_duration_s: float = float("inf") min_silence_duration_ms: int = 2000 window_size_samples: int = 1024 speech_pad_ms: int = 400 decoded_audio = audio.decode_audio("test.mp4") start = time.time() speech_chunks_512 = vad.get_speech_timestamps( decoded_audio, vad_options=VadOptions(window_size_samples=512) ) end = time.time() duration_512 = end - start start = time.time() speech_chunks_1024 = vad.get_speech_timestamps( decoded_audio, vad_options=VadOptions(window_size_samples=1024) ) end = time.time() duration_1024 = end - start start = time.time() speech_chunks_1536 = vad.get_speech_timestamps( decoded_audio, vad_options=VadOptions(window_size_samples=1536) ) end = time.time() duration_1536 = end - start print(f"512: {duration_512}", len(speech_chunks_512)) print(f"1024: {duration_1024}", len(speech_chunks_1024)) print(f"1536: {duration_1536}", len(speech_chunks_1536)) ```
Purfview commented 1 year ago

Did tests on various samples to see "1536" effects on transcriptions. I see less fallbacks, much better timestamps in some cases, very positive effects on Demucs'ed files.

I made it default in r139.2.

iorilu commented 11 months ago

Did tests on various samples to see "1536" effects on transcriptions. I see less fallbacks, much better timestamps in some cases, very positive effects on Demucs'ed files.

I made it default in r139.2.

does your application use demucs now ?

how to use demucs to preprocess audio ?

Purfview commented 11 months ago

does your application use demucs now ?

No. And I won't include it as it's using PyTorch, that's gigabytes of additional files... EDIT: Or maybe I could if pyinstaller can do hybrid onefile/onedir compiles, then I could make optional separate download for torch...

how to use demucs to preprocess audio ?

Read and ask there: https://github.com/facebookresearch/demucs

iorilu commented 11 months ago

does your application use demucs now ?

No. And I won't include it as it's using PyTorch, that's gigabytes of additional files... EDIT: Or maybe I could if pyinstaller can do hybrid onefile/onedir compiles, then I could make optional separate download for torch...

how to use demucs to preprocess audio ?

Read and ask there: https://github.com/facebookresearch/demucs

I just checked demucs, it can run on cpu , you can make it default run on cpu

Purfview commented 11 months ago

Still, cpu only torch would increase current 70Mb .exe ~6 times... And when Demucs has positive effects on accuracy it can have negative effects too, like missing punctuations and wrong separation of sentences on demucs'ed files.

Currently I'm not interested in bundling it in.

ozancaglayan commented 10 months ago

A couple of personal experience related comments here:

guillaumekln commented 10 months ago

I think it's not very useful to measure the % of time used by the VAD. You should instead compare the total execution time with and without VAD.

The VAD can remove non-speech sections which would trigger the slow temperature fallback in Whisper. In this case, the total execution time is reduced even though the VAD took X% of this time.

AvivSham commented 8 months ago

Hi all, We also see a degradation in performance when using the vad_filter=True flag. Same as others we also tried to play with the number of threads used without improvement. Is there any progress with enabling GPU support for the VAD model? Maybe you can add a different VAD model which is equally robust, but more lightweight than the current?

Thanks @guillaumekln!

Purfview commented 8 months ago

Maybe you can add a different VAD model which is equally robust, but more lightweight than the current?

But it's already lightweight and superfast.

Is there any progress with enabling GPU support for the VAD model?

People reported that there is no significant performance increase when running it on GPU.

AvivSham commented 8 months ago

Hi @Purfview, Thank you for your fast response. When running the following code it seems like the overhead of adding VAD is not negligible.

import time

from faster_whisper import WhisperModel

files_list = [
    "/home/ec2-user/datasets/vad_debug/no_speech_1.wav",
    "/home/ec2-user/datasets/vad_debug/no_speech_2.wav",
    "/home/ec2-user/datasets/vad_debug/no_speech_3.wav",
    "/home/ec2-user/datasets/vad_debug/no_speech_4.wav",
]

model_size = "large-v2"

model = WhisperModel(model_size, device="cuda", compute_type="float16")

for f in files_list:
    t_i = time.time()
    segments, _ = model.transcribe(f, beam_size=5, language="fr")
    t_i = time.time() - t_i
    time.sleep(20)
    t_j = time.time()
    segments_vad, _ = model.transcribe(
        f,
        beam_size=5,
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=2000),
        language="fr",
    )
    t_j = time.time() - t_j
    print(t_j / t_i)

These are the prints of the above script:

File 1:
0.5270593472686265

File 2:
1.0318930571300973

File 3:
1.0178552937839627

File 4:
2.4939251070712145

when reducing min_silence_duration_ms to 200:

File 1:
0.5422778267655759

File 2:
1.0773890526952445

 File 3:
1.083032817349901

File 4:
2.499190581616007

Note that the first 3 files are ~1 Sec long and the 4th is ~38 Sec long.

Any suggestions on how to make it faster for long files? @guillaumekln

Purfview commented 8 months ago

the overhead of adding VAD is not negligible

Obviously, why anyone would expect it to be negligible?

AvivSham commented 8 months ago

@Purfview let me clarify.

  1. Of course there will be overhead but not such that more than doubles the runtime for ~38 Sec long file.
  2. In addition to (1) - Whisper large-v2 has ~1.5B parameters while silero VAD has roughly 100K parameters.

Given the two points above how can we make it run faster? and if there is such a difference in the parameters count why does it add such overhead to the runtime?

@guillaumekln

Purfview commented 8 months ago

From the benchmarks posted in this thread you can see that VAD runs 134 audio seconds/s, and that's on the ancient CPU.

You can use window_size_samples=1536 to make VAD faster.

...doubles the runtime for ~38 Sec long file.

But you don't measure the whole runtime in your code example. Btw, print(t_j / t_i) doesn't make sense, this -> print(t_j - t_i) will give meaningful measurement for VAD performance.

In addition to (1) - Whisper large-v2 has...

You don't measure there large-v2's performance.

AvivSham commented 8 months ago

we want to measure the performance in percentage, therefore t_j / t_i is calculated.

You don't measure there large-v2's performance.

what do you mean? can you please suggest how to measure it correctly?

Purfview commented 8 months ago

we want to measure the performance in percentage, therefore t_j / t_i is calculated.

Now it shows something like a car's speed in percentage relative to a speed of coolant's flow. ;)

what do you mean? can you please suggest how to measure it correctly?

There you was told how to do it -> https://github.com/guillaumekln/faster-whisper/issues/271

AvivSham commented 8 months ago

I forgot about that ;). Final question - is it possible to make the transcribe call faster besides providing the language? Did you benchmark the performance w.r.t CPU threads? If running on GPU is insignificant I think we can close this issue.

Purfview commented 8 months ago

Did you benchmark the performance w.r.t CPU threads?

I didn't noticed any impact when adjusting options related to threads.