ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
34.55k stars 3.52k forks source link

Can't get streaming example to work? #416

Closed asmith26 closed 1 year ago

asmith26 commented 1 year ago

Hi there,

I'm trying to get the streaming example to work, but I can't seem to (the quickstart works fine for me). I'm running:

$  make clean  # seems to work fine

$ make stream
I whisper.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -pthread -mavx -mavx2 -mfma -mf16c -msse3
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -pthread
I LDFLAGS:  
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
cc  -I.              -O3 -std=c11   -fPIC -pthread -mavx -mavx2 -mfma -mf16c -msse3   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -pthread -c whisper.cpp -o whisper.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -pthread examples/stream/stream.cpp ggml.o whisper.o -o stream `sdl2-config --cflags --libs`

$ ./stream -m ./models/ggml-base.en.bin -t 8 --step 500 --length 5000
init: found 1 capture devices:
init:    - Capture device #0: 'Built-in Audio Analogue Stereo'
init: attempt to open default capture device ...
init: obtained spec for input device (SDL Id = 2):
init:     - sample rate:       16000
init:     - format:            33056 (required: 33056)
init:     - channels:          1 (required: 1)
init:     - samples per frame: 1024
whisper_init_from_file: loading model from './models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  500.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB

main: processing 8000 samples (step = 0.5 sec / len = 5.0 sec / keep = 0.2 sec), 8 threads, lang = en, task = transcribe, timestamps = 0 ...
main: n_new_line = 9, no_context = 1

 Hello.

 Hello
whisper_print_timings:     load time =   654.54 ms
whisper_print_timings:      mel time =   216.83 ms
whisper_print_timings:   sample time =    13.52 ms
whisper_print_timings:   encode time = 43760.50 ms / 7293.42 ms per layer
whisper_print_timings:   decode time = 143609.77 ms / 23934.96 ms per layer
whisper_print_timings:    total time = 189220.02 ms

Many thanks for any help, and for this awesome lib! :)

ggerganov commented 1 year ago

Does it work if you increase the --step argument to 3000 or 5000? Also, make sure your mic is enabled and capturing.

asmith26 commented 1 year ago

Thanks very much for your help @ggerganov, can confirm using --step 3000 (as well as 5000) seems to help - although now I seem to be getting a floating point exception:

 Thank you for the music.zsh: floating point exception (core dumped) 
./stream -m ./models/ggml-tiny.en.bin -t 8 --step 5000 --length 5000

Not sure if you have seen this before/know how to fix? Thanks again for any help! :)

ggerganov commented 1 year ago

Should be fixed now - give it another try with make clean + make stream

asmith26 commented 1 year ago

Thanks very much for the very quick fix! Can confirm it works.

The only other problem I have is the speed seems to be quite slow - but this possibly due to my hardware (I welcome any tips you have that might help it faster though). Thanks again!

ggerganov commented 1 year ago

You can try to add the following argument -ac 768 and decrease the --step by factor of 2. Should be about 2 times faster, but maybe the quality will become worse.

joshoreefe commented 8 months ago

Streaming seems to get stuck after a few lines of transcription. What switches should I use for short simple stream chunks of just a handful of words at a time?