Open AlexandderGorodetski opened 1 year ago
Lowering the window_size_samples
value may help.
In faster-whisper, the default is 1024, and you can choose between 512, 1024, and 1536.
https://github.com/snakers4/silero-vad/issues/322#issuecomment-1519015503
The VAD model is also run on a single CPU core:
Can you try changing these values and see how they impact the performance?
u can make vad run on gpu
pip uninstall onnxruntime
pip install onnxruntime-gpu
in vad.py
line 253-262 replace with
opts = onnxruntime.SessionOptions()
opts.log_severity_level = 4
opts.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_BASIC
# https://github.com/microsoft/onnxruntime/issues/11548#issuecomment-1158314424
self.session = onnxruntime.InferenceSession(
path,
providers=["CUDAExecutionProvider"],
sess_options=opts,
)
Lowering the
window_size_samples
value may help.
I get faster speed with higher value, is lower faster for you?
512: VAD speed 58 audio seconds/s - removed 01:37.831 of audio
1024: VAD speed 107 audio seconds/s - removed 01:36.495 of audio
1536: VAD speed 134 audio seconds/s - removed 01:45.383 of audio
Not sure about precision too. 1024 included insignificantly more of non-voice areas vs 1536, but 1536 excluded one voice line in music/song area.
Can you try changing these values and see how they impact the performance?
No impact for me.
u can make vad run on gpu
Could you benchmark VAD. CPU vs GPU?
u have any benchmark code & data ?
No.
I get faster speed with higher value, is lower faster for you?
After seeing your results, I tested it too, and it took longer for lower values of window_size_samples
.
512: 23.8 seconds - 296 speech chunks
1024: 12.7 seconds - 288 speech chunks
1536: 10.9 seconds - 298 speech chunks
Not sure about precision too. 1024 included insignificantly more of non-voice areas vs 1536, but 1536 excluded one voice line in music/song area.
I'm not sure about the precision, I'll check it later.
Did tests on various samples to see "1536" effects on transcriptions. I see less fallbacks, much better timestamps in some cases, very positive effects on Demucs'ed files.
I made it default in r139.2.
Did tests on various samples to see "1536" effects on transcriptions. I see less fallbacks, much better timestamps in some cases, very positive effects on Demucs'ed files.
I made it default in r139.2.
does your application use demucs now ?
how to use demucs to preprocess audio ?
does your application use demucs now ?
No. And I won't include it as it's using PyTorch, that's gigabytes of additional files... EDIT: Or maybe I could if pyinstaller can do hybrid onefile/onedir compiles, then I could make optional separate download for torch...
how to use demucs to preprocess audio ?
Read and ask there: https://github.com/facebookresearch/demucs
does your application use demucs now ?
No. And I won't include it as it's using PyTorch, that's gigabytes of additional files... EDIT: Or maybe I could if pyinstaller can do hybrid onefile/onedir compiles, then I could make optional separate download for torch...
how to use demucs to preprocess audio ?
Read and ask there: https://github.com/facebookresearch/demucs
I just checked demucs, it can run on cpu , you can make it default run on cpu
Still, cpu only torch would increase current 70Mb .exe ~6 times... And when Demucs has positive effects on accuracy it can have negative effects too, like missing punctuations and wrong separation of sentences on demucs'ed files.
Currently I'm not interested in bundling it in.
A couple of personal experience related comments here:
intra_op_num_threads
effect on CPU inference is limited. I get slightly better runtime with 4
threads compared to 1
but > 4
is basically useless in my case/CPU. It's not even 2x speed-up when you have 4
threads set.window_size_samples
is the easiest way of improving the speed as it has less windows to process & forward-pass through the model.I think it's not very useful to measure the % of time used by the VAD. You should instead compare the total execution time with and without VAD.
The VAD can remove non-speech sections which would trigger the slow temperature fallback in Whisper. In this case, the total execution time is reduced even though the VAD took X% of this time.
Hi all,
We also see a degradation in performance when using the vad_filter=True
flag. Same as others we also tried to play with the number of threads used without improvement. Is there any progress with enabling GPU support for the VAD model? Maybe you can add a different VAD model which is equally robust, but more lightweight than the current?
Thanks @guillaumekln!
Maybe you can add a different VAD model which is equally robust, but more lightweight than the current?
But it's already lightweight and superfast.
Is there any progress with enabling GPU support for the VAD model?
People reported that there is no significant performance increase when running it on GPU.
Hi @Purfview, Thank you for your fast response. When running the following code it seems like the overhead of adding VAD is not negligible.
import time
from faster_whisper import WhisperModel
files_list = [
"/home/ec2-user/datasets/vad_debug/no_speech_1.wav",
"/home/ec2-user/datasets/vad_debug/no_speech_2.wav",
"/home/ec2-user/datasets/vad_debug/no_speech_3.wav",
"/home/ec2-user/datasets/vad_debug/no_speech_4.wav",
]
model_size = "large-v2"
model = WhisperModel(model_size, device="cuda", compute_type="float16")
for f in files_list:
t_i = time.time()
segments, _ = model.transcribe(f, beam_size=5, language="fr")
t_i = time.time() - t_i
time.sleep(20)
t_j = time.time()
segments_vad, _ = model.transcribe(
f,
beam_size=5,
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=2000),
language="fr",
)
t_j = time.time() - t_j
print(t_j / t_i)
These are the prints of the above script:
File 1:
0.5270593472686265
File 2:
1.0318930571300973
File 3:
1.0178552937839627
File 4:
2.4939251070712145
when reducing min_silence_duration_ms
to 200
:
File 1:
0.5422778267655759
File 2:
1.0773890526952445
File 3:
1.083032817349901
File 4:
2.499190581616007
Note that the first 3 files are ~1 Sec long and the 4th is ~38 Sec long.
Any suggestions on how to make it faster for long files? @guillaumekln
the overhead of adding VAD is not negligible
Obviously, why anyone would expect it to be negligible?
@Purfview let me clarify.
Whisper large-v2
has ~1.5B
parameters while silero VAD
has roughly 100K
parameters.Given the two points above how can we make it run faster? and if there is such a difference in the parameters count why does it add such overhead to the runtime?
@guillaumekln
From the benchmarks posted in this thread you can see that VAD runs 134 audio seconds/s, and that's on the ancient CPU.
You can use window_size_samples=1536
to make VAD faster.
...doubles the runtime for ~38 Sec long file.
But you don't measure the whole runtime in your code example.
Btw, print(t_j / t_i)
doesn't make sense, this -> print(t_j - t_i)
will give meaningful measurement for VAD performance.
In addition to (1) -
Whisper large-v2
has...
You don't measure there large-v2
's performance.
we want to measure the performance in percentage, therefore t_j / t_i
is calculated.
You don't measure there
large-v2's
performance.
what do you mean? can you please suggest how to measure it correctly?
we want to measure the performance in percentage, therefore
t_j / t_i
is calculated.
Now it shows something like a car's speed in percentage relative to a speed of coolant's flow. ;)
what do you mean? can you please suggest how to measure it correctly?
There you was told how to do it -> https://github.com/guillaumekln/faster-whisper/issues/271
I forgot about that ;).
Final question - is it possible to make the transcribe
call faster besides providing the language
? Did you benchmark the performance w.r.t CPU threads?
If running on GPU is insignificant I think we can close this issue.
Did you benchmark the performance w.r.t CPU threads?
I didn't noticed any impact when adjusting options related to threads.
Hello guys,
I am using VAD of faster whisper using following commands. I found that on TedLium benchmark transcribing VAD takes 8% of time and 92% takes transcribing. I would prefer to decrease time of VAD so that it will not take more than 1%. Is it somehow possible to optimize VAD procedure in terms of real time?? Maybe it is possible to run VAD on several CPU's? BTW, I see that VAD is running on CPU, is it possible to run it somehow on GPU?