Closed biemster closed 2 years ago
The Whisper model processes the audio in chunks of 30 seconds - this is a hard constraint of the architecture.
However, what seems to work is you can take for example 5 seconds of audio and pad it with 25 seconds of silence. This way you can process shorter chunks.
Given that, an obvious strategy for realtime audio transcription is the following:
T - [data]
-------------------
1 - [1 audio, 29 silence pad] -> transcribe -> "He"
2 - [2 audio, 28 silence pad] -> transcribe -> "Hello"
3 - [3 audio, 27 silence pad] -> transcribe -> "Hello, my"
...
29 - [29 audio, 1 silence pad] -> transcribe -> "Hello, my name is John ..."
The problem with that is you need to do the same amount of computation for 1 second audio as you would do for 2, 3, ... , 30 seconds of audio. So if your audio input step is 1 second (as shown in the example above), you will effectively do 30 times the computation that you would normally do to process the full 30 seconds.
I plan to add a basic example of real-time audio transcription using the above strategy.
I was a bit afraid that that would be the answer, but I'll definitely will check out that basic example when it's ready!
Just added a very naive implementation of the idea above. To run it, simply do:
# install sdl2 (Ubuntu)
$ sudo apt-get install libsdl2-dev
# install sdl2 (Mac OS)
$ brew install sdl2
# download a model if you don't have one
$ ./download-ggml-model.sh base.en
# run the real-time audio transcription
$ make stream
$ ./stream -m models/ggml-base.en.bin
This example continuously captures audio from the mic and runs whisper on the captured audio. The time step is currently hardcoded at 3 seconds.
The results are not great because the current implementation can chop the audio in the middle of words. Also, the text context is reset for every new iteration.
However, all these things can be significantly improved. Probably we need to add some sort of simple VAD as a preprocessing step.
Here is a short video demonstration of near real-time transcription from the microphone:
https://user-images.githubusercontent.com/1991296/193465125-c163d304-64f6-4f5d-83e5-72239c9a203e.mp4
Nice, but somehow can't kill it (stream) I had to do a killall -9 stream. On a AMD 2 thread 3ghz processor with 16GB RAM, there is significant delay. However I found that I get 2x realtime with the usual transcription on audio file. Great work. I love this.
Thanks for the feedback. I just pushed a fix that should handle Ctrl+C correctly (it can take a few seconds to respond though).
Regarding the performance - I hope it can be improved with a better strategy to decide when to perform the inference. Currently, it is done every X seconds, regardless of the data. If we add voice detection, we should be able to run it less often. But overall, it seems that real-time transcription will always be slower compared to the original 30-seconds chunk transcription.
Thanks for the quick fix. I have some suggestions/ideas, for faster voice transcription. Give me half an hour to one hour, I'll update here with new content.
Edit / Updated:
Here are some ideas to speed up offline non real time transcription:
Removing silence helps a lot in reducing total time of audio (not yet tried but obvious):
http://andrewslotnick.com/posts/speeding-up-a-speech.html#Remove-Silence
Things that I tried with good results:
First I ran an half an hour audio file through https://github.com/xiph/rnnoise code. Then I increased the tempo to 1.5 with sox (tempo preserves pitch). After that I got good results with tiny.en but base.en seemed to be less accurate. Overall process is much faster - real fast transcription except for the initial delay.
cd /tmp
./rnnoise_demo elon16.wav elon16.raw
sox -c 1 -r 16000 -b 16 --encoding signed-integer elon16.raw elon16_denoised.wav
sox elon16_denoised.wav elonT3.wav tempo 1.5
./main -m models/ggml-tiny.en.bin -f /tmp/elonT3.wav
Here are some ideas for faster real time transcription:
I noticed that when I ran this on a 5 sec clip, I got this result:
./main -m models/ggml-tiny.en.bin -f /tmp/rec.wav
log_mel_spectrogram: recording length: 5.015500 s
...
main: processing 80248 samples (5.0 sec), 2 threads, lang = english, task = transcribe, timestamps = 1 ...
[00:00.000 --> 00:05.000] Okay, this is a test. I think this will work out nicely.
[00:05.000 --> 00:10.000] [no audio]
...
main: total time = 18525.62 ms
Now if we could apply this:
1 VAD / Silence detection(like you mentioned) split into chunks. The result is variable length audio chunks in memory or temp files 2 Remove noise with rrnoise on chunks 3 Speed up chunck by 1.5x preserving pitch (the speed up should just be an option. I learned that anything above 1.5x results are bad except if voice is loud clear and slow to start with, 1.5x is safe. Ideal is 1.1-1.5x max 2x) 4 Since we know exactly how long the sped up chunk is, we won't need to wait for transcription to finish...
Example: [00:00.000 --> 00:05.000] Okay, this is a test. I think this will work out nicely. <--- We could kill it right here (cos this is the total length of that file / chunck I had as example) [00:05.000 --> 00:10.000] [no audio] <-- This is processing on empty buffer, when killed would not waste processing
VAD: https://github.com/cirosilvano/easyvad or maybe use webrtc vad?
I guess experimentation is needed to figure out the best strategy / approach to real time considering the 30 sec at once issue.
Some improvement on the real-time transcription:
https://user-images.githubusercontent.com/1991296/194935793-76afede7-cfa8-48d8-a80f-28ba83be7d09.mp4
I'll check this out and give you feedback here tomorrow. Awesome work! Brilliant.
hello @ggerganov , thanks for sharing! Offline main mode tested here on Windows worked fine.
~Any small tip to include the SDL to make work the real time app?~
On resource constrained machines it doesn't seem to be better. The previous version worked for transcribing, this one is chocking the cpu with no or only intermittent output. Same kill issue persists - I think its because processes are spawned and makes the system laggy.
@ggerganov I also caught a Floating point exception (core dumped) playing around with options : -t 2 --step 5000 --length 5000
./stream -m ./models/ggml-tiny.en.bin -t 2 --step 5000 --length 5000
audio_sdl_init: found 2 capture devices:
audio_sdl_init: - Capture device #0: 'Built-in Audio'
audio_sdl_init: - Capture device #1: 'Built-in Audio Analog Stereo'
audio_sdl_init: attempt to open default capture device ...
audio_sdl_init: obtained spec for input device (SDL Id = 2):
audio_sdl_init: - sample rate: 16000
audio_sdl_init: - format: 33056 (required: 33056)
audio_sdl_init: - channels: 1 (required: 1)
audio_sdl_init: - samples per frame: 1024
whisper_model_load: loading model from './models/ggml-tiny.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 1
whisper_model_load: mem_required = 244.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 84.99 MB
whisper_model_load: memory size = 11.41 MB
whisper_model_load: model size = 73.54 MB
main: processing 80000 samples (step = 5.0 sec / len = 5.0 sec), 2 threads, lang = en, task = transcribe, timestamps = 0 ...
Floating point exception (core dumped)
But I think this not working on resource constrained devices should not be a blocker for you. If it works for everyone else, please feel free to close.
I think that floating point exception might be related to #39 as well, which was running on a 4 core AMD64 Linux server, not too resource constrained.
The stream
example should be updated to detect if it is able to process the incoming audio stream in real-time and provide some warning or error if it is not the case. Otherwise, it will behave in undefined way.
Also mentioning this here since it would be a super cool feature to have: Any way to register a callback or call a script once user speech is completed and silence/non-speak is detected? Been trying to hack on the CPP code but my CPP skills are rusty :(
@pachacamac Will think about adding this option. Silence/non-speak detection is not trivial in general, but maybe some simple thresholding approach that works in quiet environment should not be too difficult to implement.
Hi, I am trying to run real-time transcription on the Raspberry Pi 4B, with a ReSpeaker Mic array. Is there any way to specify the audio input device when running ./stream
?
Hi, I am trying to run real-time transcription on the Raspberry Pi 4B, with a ReSpeaker Mic array. Is there any way to specify the audio input device when running
./stream
?
I was able to specify the default input device through /etc/asound.conf
:
pcm.!default {
type asym
playback.pcm {
type plug
slave.pcm "hw:0,0"
}
capture.pcm {
type plug
slave.pcm "hw:1,0"
}
}
Curious if you have any luck getting real-time transcription to work on a Pi 4. Mine seems to run just a little too slow to give useful results, even with the tiny.en
model.
Hi @alexose and @RyanSelesnik:
Have you had any success using Respeaker 4 Mic Array (UAC1.0) to run the stream script on Raspberry? My system conf is:
ubuntu@ubuntu:~/usrlib/whisper.cpp$ ./stream -m ./models/ggml-tiny.en.bin -t 8 --step 500 --length 5000 -c 0 audio_sdl_init: found 1 capture devices: audio_sdl_init: - Capture device #0: 'ReSpeaker 4 Mic Array (UAC1.0), USB Audio' audio_sdl_init: attempt to open capture device 0 : 'ReSpeaker 4 Mic Array (UAC1.0), USB Audio' ... audio_sdl_init: obtained spec for input device (SDL Id = 2): audio_sdl_init: - sample rate: 16000 audio_sdl_init: - format: 33056 (required: 33056) audio_sdl_init: - channels: 1 (required: 1) audio_sdl_init: - samples per frame: 1024 whisper_model_load: loading model from './models/ggml-tiny.en.bin' whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 384 whisper_model_load: n_audio_head = 6 whisper_model_load: n_audio_layer = 4 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 384 whisper_model_load: n_text_head = 6 whisper_model_load: n_text_layer = 4 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 1 whisper_model_load: mem_required = 390.00 MB whisper_model_load: adding 1607 extra tokens whisper_model_load: ggml ctx size = 73.58 MB whisper_model_load: memory size = 11.41 MB whisper_model_load: model size = 73.54 MB
main: processing 8000 samples (step = 0.5 sec / len = 5.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 0 ... main: n_new_line = 9
[BLANK_AUDIO]
main: WARNING: cannot process audio fast enough, dropping audio ...
but using whisper on prerecorded audio with the same Respeaker devices the whisper worked well:
sudo arecord -f S16_LE -d 10 -r 16000 --device="hw:1,0" /tmp/test-mic.wav ./main -m models/ggml-tiny.en.bin -f /tmp/test-mic.wav
Any suggestions to test ./stream properly Cheers AR
@andres-ramirez-duque I haven't had any luck in getting the streaming functionality to run fast enough. aarch64
with a Pi 4B 2GB. I've tried compiling with various flags (-Ofast) and trying various step length, thread count, etc.
I'm not good enough with C++ to know where to start optimizing, but I suspect the comment in PR #23 sheds some light on the issue:
On Arm platforms without __ARM_FEATURE_FP16_VECTOR_ARITHMETIC we convert to 32-bit floats. There might be a more efficient way, but this is good for now.
As well as the notes on optimization from @trholding and @ggerganov above.
So the rule-of-thumb for using the stream
example is to first run the bench
tool using the model that you want to try. For example:
$ make bench
$ ./bench models/ggml-tiny.en
whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 1
whisper_model_load: mem_required = 390.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 73.58 MB
whisper_model_load: memory size = 11.41 MB
whisper_model_load: model size = 73.54 MB
system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |
whisper_print_timings: load time = 103.94 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 174.70 ms / 43.67 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 278.77 ms
Note down the encode time
. In this case, it is 174 ms
.
Your --step
parameter for the stream
tool should be at least x2 the encode time
. So in this case: --step 350
.
If the step is smaller, then it is very likely that the processing will be slower compared to the audio capture and it won't work.
Overall, streaming on Reaspberries is a long shot. Maybe when they start supporting the FP16 arithmetic (i.e. ARMv8.2
instruction set) it could make sense.
Yes, makes sense.
For those of us trying to make this work on a cheap single-board computer, we'll probably want to use something like a Banana Pi BPI M5 (which is form-factor compatible with the Pi 4 but ships with a Cortex A55).
Hi, when I use the stream features, it throws an error, found 0 capture device. The whole errors are the following: audio_sdl_init: found 0 capture devices: audio_sdl_init: attempt to open default capture device ... ALSA lib confmisc.c:767:(parse_card) cannot find card '0' ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory ALSA lib conf.c:5220:(snd_config_expand) Evaluate error: No such file or directory ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM default audio_sdl_init: couldn't open an audio device for capture: ALSA: Couldn't open audio device: No such file or directory! whisper_model_load: loading model from './models/ggml-base.en.bin' whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 2 whisper_model_load: mem_required = 506.00 MB whisper_model_load: adding 1607 extra tokens whisper_model_load: ggml ctx size = 140.60 MB whisper_model_load: memory size = 22.83 MB whisper_model_load: model size = 140.54 MB
main: processing 8000 samples (step = 0.5 sec / len = 5.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 0 ... main: n_new_line = 9
[Start speaking]
Any idea how to solve this problem? Thanks in advance!
Hi, when I use the stream features, it throws an error, found 0 capture device. The whole errors are the following: audio_sdl_init: found 0 capture devices: audio_sdl_init: attempt to open default capture device ... ALSA lib confmisc.c:767:(parse_card) cannot find card '0' ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory ALSA lib conf.c:5220:(snd_config_expand) Evaluate error: No such file or directory ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM default audio_sdl_init: couldn't open an audio device for capture: ALSA: Couldn't open audio device: No such file or directory! whisper_model_load: loading model from './models/ggml-base.en.bin' whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 2 whisper_model_load: mem_required = 506.00 MB whisper_model_load: adding 1607 extra tokens whisper_model_load: ggml ctx size = 140.60 MB whisper_model_load: memory size = 22.83 MB whisper_model_load: model size = 140.54 MB
main: processing 8000 samples (step = 0.5 sec / len = 5.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 0 ... main: n_new_line = 9
[Start speaking]
Any idea how to solve this problem? Thanks in advance!
i saw the same problems whis i use aws cloud server .
I'm guessing this was a whisper quirk, and not a issue with the stream script combining the output?
y propiedades se clasifican en vitamina A, B1, B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, B15, B16, B17, B17, B18, B19, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B20, B
Thank you, my dear Dodi, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my dear, my
I haven't noticed this behaviour with non-streaming usage..
Hi, when I use the stream features, it throws an error, found 0 capture device. The whole errors are the following: audio_sdl_init: found 0 capture devices: audio_sdl_init: attempt to open default capture device ... ALSA lib confmisc.c:767:(parse_card) cannot find card '0' ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory ALSA lib conf.c:5220:(snd_config_expand) Evaluate error: No such file or directory ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM default audio_sdl_init: couldn't open an audio device for capture: ALSA: Couldn't open audio device: No such file or directory! whisper_model_load: loading model from './models/ggml-base.en.bin' whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 2 whisper_model_load: mem_required = 506.00 MB whisper_model_load: adding 1607 extra tokens whisper_model_load: ggml ctx size = 140.60 MB whisper_model_load: memory size = 22.83 MB whisper_model_load: model size = 140.54 MB
main: processing 8000 samples (step = 0.5 sec / len = 5.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 0 ... main: n_new_line = 9
[Start speaking]
Any idea how to solve this problem? Thanks in advance!
stream
looks for a capture device (a.k.a microphone) and didn't found
found 0 capture devices
When I'm build stream file, seemed next error:
how I can fix it?
When I'm build stream file, seemed next error: how I can fix it?
I'm was enter "make clean" and it resolve my problem
Hi! Were you able to use WebRTC stream as an input to stream.cpp? I was trying to hack a webRTC input stream into the stream.cpp code but it's not clear to me how I should go about buffering it before passing it on to Whisper. The buffering logic you used in common-sdl.cpp seems to be very intertwined with SDL. Any help would be appreciated!
there is a bug, I think:
-l auto is rejected: _whisper_langid: unknown language 'auto'
However, it works with main, so it is implemented! I need this, since as a German speaker, we often mix English and German.
how to use real-time function on android platform?
Would it also be possible to allow input from playback devices instead of recording devices? I've tried fiddling around with the code myself but I couldn't get it to work. Would be very cool if this was supported natively though!
@SammyWhamy I don't think this is possible, you need to send the audio to 2 playback devices using something like BlackHole or another audio loopback driver.
@johtso Oh, does SDL not support capturing audio from a playback device?
Life Hack: Just use an audio cable and put one end in the audio output port of your computer and the other end in the audio input port of your computer ;-)
Am Di., 2. Mai 2023 um 23:38 Uhr schrieb Sam Teeuwisse < @.***>:
@johtso https://github.com/johtso Oh, does SDL not support capturing audio from a playback device?
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/issues/10#issuecomment-1532188102, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG62LRRIBBKB4UNEK2KW7TXEF5HBANCNFSM6AAAAAAQ2O4ACQ . You are receiving this because you commented.Message ID: @.***>
I've been trying to get this working with the Arch Linux AUR whisper.cpp installation, currently getting this error:
error: failed to open 'stream' as WAV file
error: failed to read WAV file 'stream'
Hi @ggerganov in your demo video, which config do you use ? Seems with a Macbook Pro (M1 Pro), i can't reproduce something as close in velocity.
i used models/ggml-base.bin
though, keeping english as spoken, that may explain why?
side question: should i want to make the fastest real-time transcription, lowest RTF, for any languages, which (reasonably cheap) configuration would you advise? i don't mind to offload the computational part to a dedicated external server and send the data packets back to the client.
Is it possible to run this stream cpp tool on iOS?
Hello, I'm trying to run this on an m2.
I tried everything I can think of and it doesn't work.
First of all doing
(py310-whisper) ➜ whisper.cpp git:(master) ✗ make stream
make: *** No rule to make target stream'. Stop.
Doesn't work, if I cd into the examples/stream
(py310-whisper) ➜ stream git:(master) ✗ make
(py310-whisper) ➜ stream git:(master) ✗ ./stream -m ../../models/ggml-large.bin
zsh: no such file or directory: ./stream
(py310-whisper) ➜ stream git:(master) ✗ pwd
/Users/adrian/Developer/whisper.cpp/examples/stream
I can't make it work either.
Is running the large model unreasonable? What alternative there's if I speak in 2 languages, english and spanish?
Should I use WHISPER_COREML=1 make -j
?
Hello, I'm trying to run this on an m2.
I tried everything I can think of and it doesn't work.
First of all doing
(py310-whisper) ➜ whisper.cpp git:(master) ✗ make stream make: *** No rule to make target stream'. Stop.
Doesn't work, if I cd into the examples/stream
(py310-whisper) ➜ stream git:(master) ✗ make (py310-whisper) ➜ stream git:(master) ✗ ./stream -m ../../models/ggml-large.bin zsh: no such file or directory: ./stream (py310-whisper) ➜ stream git:(master) ✗ pwd /Users/adrian/Developer/whisper.cpp/examples/stream
I can't make it work either.
Is running the large model unreasonable? What alternative there's if I speak in 2 languages, english and spanish?
Should I use
WHISPER_COREML=1 make -j
?
I got your first error when i tried to use make stream in the wrong folder. Maybe you did pick the right folder but maybe somehow something was missing.
Dear all expert here,
I a, able to run my raspberry pi 4 whisper.cpp run the live stream using following code:
git clone https://github.com/ggerganov/whisper.cpp ./models/download-ggml-model.sh tiny.en make -j stream && ./stream -m models/ggml-tiny.en.bin --step 4000 --length 8000 -c 0 -t 4 -ac 512
The question is that possible I can convert those word in python text format?
Appreciate someone can help!
Dear all expert here,
I a, able to run my raspberry pi 4 whisper.cpp run the live stream using following code:
git clone https://github.com/ggerganov/whisper.cpp ./models/download-ggml-model.sh tiny.en make -j stream && ./stream -m models/ggml-tiny.en.bin --step 4000 --length 8000 -c 0 -t 4 -ac 512
The question is that possible I can convert those word in python text format?
Appreciate someone can help!
you can maybe use the f option (if necessary see
./stream --help
) to output the transcription onto a file.
From there you can use a script to do whatever you want with the output from the file.
If you want to do this continuously with the stream, you could maybe consider using a while loop that applies some function to the file and that checks for updates on the file. I'm not an expert but chatgpt told me that on linux inotify can check for updates on a file.
I've managed to compile and run this with the OpenVINO backend. Only required adding 1 line to initialize the OpenVINO encoder and some additional flags during compilation to build both OpenVINO and stream.
quick and dirty setup:
Add the following line to stream.cpp after initializing the whisper context:
whisper_ctx_init_openvino_encoder(ctx, nullptr, "GPU", nullptr);
Then make sure you follow all the steps to install OpenVINO, then to build:
mkdir build
cd build
cmake -DWHISPER_OPENVINO=1 -DWHISPER_BUILD_EXAMPLES=1 -DWHISPER_SDL2=1 ..
the -ac
flag no longer works with this method but I am now able to transcribe my meetings with the small.en model running on a laptop with just intel integrated graphics.
I found this Swift implementation of streaming: https://github.com/leetcode-mafia/cheetah/blob/b7e301c0ae16df5c597b564b2126e10e532871b2/LibWhisper/stream.cpp with a swift file inside a swit project. It's CC0 licensed.
I couldn't tell if it's uses the right config to benefit from the latest Metal/OpenML performance-oriented config
usage: ./stream [options]
options: -h, --help [default] show this help message and exit -t N, --threads N [4 ] number of threads to use during computation --step N [3000 ] audio step size in milliseconds --length N [10000 ] audio length in milliseconds --keep N [200 ] audio to keep from previous step in ms -c ID, --capture ID [-1 ] capture device ID -mt N, --max-tokens N [32 ] maximum number of tokens per audio chunk -ac N, --audio-ctx N [0 ] audio context size (0 - all) -vth N, --vad-thold N [0.60 ] voice activity detection threshold -fth N, --freq-thold N [100.00 ] high-pass frequency cutoff -su, --speed-up [false ] speed up audio by x2 (reduced accuracy) -tr, --translate [false ] translate from source language to english -nf, --no-fallback [false ] do not use temperature fallback while decoding -ps, --print-special [false ] print special tokens -kc, --keep-context [false ] keep context between audio chunks -l LANG, --language LANG [en ] spoken language -m FNAME, --model FNAME [models/ggml-base.en.bin] model path -f FNAME, --file FNAME [ ] text output file name -tdrz, --tinydiarize [false ] enable tinydiarize (requires a tdrz model)
Hi, I am sometime confused on what each argument do. Can someone who understand explain what those mean: -kc, --keep-context [false ] keep context between audio chunks --keep N [200 ] audio to keep from previous step in ms -ac N, --audio-ctx N [0 ] audio context size (0 - all)
Or is there a link or file I can read myself to understand those parameters better? Thank you so much!
Thanks for the quick fix. I have some suggestions/ideas, for faster voice transcription. Give me half an hour to one hour, I'll update here with new content.
Edit / Updated:
Here are some ideas to speed up offline non real time transcription:
Removing silence helps a lot in reducing total time of audio (not yet tried but obvious):
http://andrewslotnick.com/posts/speeding-up-a-speech.html#Remove-Silence
Things that I tried with good results:
First I ran an half an hour audio file through https://github.com/xiph/rnnoise code. Then I increased the tempo to 1.5 with sox (tempo preserves pitch). After that I got good results with tiny.en but base.en seemed to be less accurate. Overall process is much faster - real fast transcription except for the initial delay.
cd /tmp ./rnnoise_demo elon16.wav elon16.raw sox -c 1 -r 16000 -b 16 --encoding signed-integer elon16.raw elon16_denoised.wav sox elon16_denoised.wav elonT3.wav tempo 1.5 ./main -m models/ggml-tiny.en.bin -f /tmp/elonT3.wav
Here are some ideas for faster real time transcription:
I noticed that when I ran this on a 5 sec clip, I got this result:
./main -m models/ggml-tiny.en.bin -f /tmp/rec.wav log_mel_spectrogram: recording length: 5.015500 s ... main: processing 80248 samples (5.0 sec), 2 threads, lang = english, task = transcribe, timestamps = 1 ... [00:00.000 --> 00:05.000] Okay, this is a test. I think this will work out nicely. [00:05.000 --> 00:10.000] [no audio] ... main: total time = 18525.62 ms
Now if we could apply this:
1 VAD / Silence detection(like you mentioned) split into chunks. The result is variable length audio chunks in memory or temp files 2 Remove noise with rrnoise on chunks 3 Speed up chunck by 1.5x preserving pitch (the speed up should just be an option. I learned that anything above 1.5x results are bad except if voice is loud clear and slow to start with, 1.5x is safe. Ideal is 1.1-1.5x max 2x) 4 Since we know exactly how long the sped up chunk is, we won't need to wait for transcription to finish...
Example: [00:00.000 --> 00:05.000] Okay, this is a test. I think this will work out nicely. <--- We could kill it right here (cos this is the total length of that file / chunck I had as example) [00:05.000 --> 00:10.000] [no audio] <-- This is processing on empty buffer, when killed would not waste processing
VAD: https://github.com/cirosilvano/easyvad or maybe use webrtc vad?
I guess experimentation is needed to figure out the best strategy / approach to real time considering the 30 sec at once issue.
Were you able to implement any of these ideas? Are there significant performance improvements?
Hi, I just tried the following command:
make stream
./stream -m ./models/ggml-base.en.bin -t 8 --step 500 --length 5000
And it works, but the result I get is far behind the demo video. It just gets stuck with the first sentence and tries to update it instead of adding new sentences.
silero-vad looks best for VAD but I don't know how to port this to Swift yet - onnx and python notebook
edit: found a swift port, https://github.com/tangfuhao/Silero-VAD-for-iOS
After making my model using Core ML, I got this error while trying to build stream:
$ make stream
I whisper.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -DGGML_USE_METAL
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
I CC: Apple clang version 15.0.0 (clang-1500.0.40.1)
I CXX: Apple clang version 15.0.0 (clang-1500.0.40.1)
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -DGGML_USE_METAL examples/stream/stream.cpp examples/common.cpp examples/common-ggml.cpp examples/common-sdl.cpp ggml.o ggml-alloc.o ggml-backend.o ggml-quants.o whisper.o ggml-metal.o -o stream `sdl2-config --cflags --libs` -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
ld: Undefined symbols:
_whisper_coreml_encode, referenced from:
whisper_build_graph_conv(whisper_context&, whisper_state&, int) in whisper.o
_whisper_coreml_free, referenced from:
_whisper_free_state in whisper.o
_whisper_coreml_init, referenced from:
_whisper_init_state in whisper.o
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [stream] Error 1
Noting that the processing time is considerably shorter than the length of speech, is it possible to feed the models real time microphone output? Or does the inference run on the complete audio stream, instead of sample by sample?
This would greatly reduce the latency for voice assistants and the like, that the audio does not need to be fully captured and only after that fed to the models. Basically the same as I did here with SODA: https://github.com/biemster/gasr, but then with an open source and multilang model.