First, thanks for creating a fantastic project! I was looking for a way to run Whisper or some other speech-to-text model in realtime. I found several potential solutions but this one is clearly the best, especially for implementing custom applications on top.
I noticed that faster-whisper supports quantized models but RealtimeSTT currently doesn't expose that option. With int8 quantization, models take up much less VRAM (or RAM, if run on CPU only). The quality of model output may suffer a little bit, but I think it's still a worthwhile optimization when memory is tight.
I have a laptop with an integrated NVIDIA GeForce MX150 GPU that only has 2GB VRAM. I was able to run the small model without problems (with tiny as the realtime model), but the medium and larger models gave a CUDA out of memory error.
by adding the parameter compute_type='int8'. This resulted in quantized models and the medium model can now fit on my feeble GPU; sadly, the large-v2 model is still too big.
GPU VRAM requirements as reported by nvidia-smi with and without quantization of the main model (realtime model is always tiny with the same quantization applied as for the main model):
model
default
int8
tiny
542MiB
246MiB
base
914MiB
278MiB
small
1386MiB
532MiB
medium
out of memory
980MiB
large-v2
out of memory
out of memory
This could be exposed as an additional parameter compute_type for AudioToTextRecorder; or possibly two separate parameters, one for the realtime model and another for the main model. This parameter would then simply be passed as compute_type to the WhisperModel(s).
First, thanks for creating a fantastic project! I was looking for a way to run Whisper or some other speech-to-text model in realtime. I found several potential solutions but this one is clearly the best, especially for implementing custom applications on top.
I noticed that faster-whisper supports quantized models but RealtimeSTT currently doesn't expose that option. With
int8
quantization, models take up much less VRAM (or RAM, if run on CPU only). The quality of model output may suffer a little bit, but I think it's still a worthwhile optimization when memory is tight.I have a laptop with an integrated NVIDIA GeForce MX150 GPU that only has 2GB VRAM. I was able to run the
small
model without problems (withtiny
as the realtime model), but themedium
and larger models gave a CUDA out of memory error.To enable quantization, I tweaked the initialization of WhisperModel here https://github.com/KoljaB/RealtimeSTT/blob/f2c60530572dea97b6826423039e788fc7e72638/RealtimeSTT/audio_recorder.py#L367-L370 and here https://github.com/KoljaB/RealtimeSTT/blob/f2c60530572dea97b6826423039e788fc7e72638/RealtimeSTT/audio_recorder.py#L517-L520
by adding the parameter
compute_type='int8'
. This resulted in quantized models and themedium
model can now fit on my feeble GPU; sadly, thelarge-v2
model is still too big.GPU VRAM requirements as reported by
nvidia-smi
with and without quantization of the main model (realtime model is alwaystiny
with the same quantization applied as for the main model):This could be exposed as an additional parameter
compute_type
for AudioToTextRecorder; or possibly two separate parameters, one for the realtime model and another for the main model. This parameter would then simply be passed ascompute_type
to the WhisperModel(s).