ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
31.77k stars 3.15k forks source link

is it possible to run openai-whisper ggml model on raspberry pi hardware? #7

Closed nyadla-sys closed 1 year ago

nyadla-sys commented 1 year ago

is it possible to run this gghml model on raspberry pi hardware?

nyadla-sys commented 1 year ago

@ggerganov could you please help on this ?

ggerganov commented 1 year ago

It will probably work - why don't you give it a try?

ggerganov commented 1 year ago

Good news!

I just tried it on a Raspberry Pi 4 Model B from 2018 and it works!

The tiny.en model takes 140 sec to transcribe a 30 sec audio, but I think this can be improved, because I disabled all SIMD instructions to make it compile. I will improve this in the following days.

If you want to try it, use the raspberry branch:

git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
git checkout reaspberry
make tiny.en
nyadla-sys commented 1 year ago

I don't currently have a Raspberry Pi board, but I will run as soon as I get one.

ggerganov commented 1 year ago

You can try running it on whatever Raspberry you have - use the same instructions.

nyadla-sys commented 1 year ago

@ggerganov Thanks and appreciate for your quick response

ggerganov commented 1 year ago

Some more experiments - enabling NEON instructions reduces the time all the way down to just ~15 seconds to process a 30 second audio.

nyadla-sys commented 1 year ago

this is awesome

nyadla-sys commented 1 year ago

@ggerganov is it possible to do the audio streaming on Raspberry Pi and live convert it to captions?

WilliamTambellini commented 1 year ago

@ggerganov perhaps have a look at https://github.com/oneapi-src/oneDNN It would resolve all the compilation issues by letting onednn do the optimizations for the local CPU at runtime whatever the cpu model/brand.

nyadla-sys commented 1 year ago

@ggerganov On a linux computer, I tried the following commands, and streaming performed as expected. $ git clone https://github.com/ggerganov/whisper.cpp.git
$ bash ./download-ggml-model.sh tiny.en $ sudo apt-get install libsdl2-dev $ make $ make stream -lSDL2 $ ./stream -m models/ggml-tiny.en.bin

nyadla-sys commented 1 year ago

@ggerganov Used the following command to run a stream on a Raspberry Pi 4, but its decoding speed is slow(performance is poor). Perhaps further improvements are needed.(Currently each 30 seconds audio inference is taking around 15 seconds to execute) ./stream -m models/ggml-tiny.en.bin

ggerganov commented 1 year ago

@nyadla-sys The performance can be improved if the CPU supports the ARM8.2 architecture - it provides 16-bit floating point vector arithmetic. The whisper.cpp implementation already supports this so you just need the correct hardware.

Based on this table, you need a device with a Cortex-A75 CPU:

https://en.wikipedia.org/wiki/Comparison_of_Armv8-A_processors

From a quick google search, none of the existing Raspberry products comes with this processor.

There are rumours that Raspberry Pi 5 will use ARM Cortex-A75 or ARM Cortex-A76 so if that is the case, you should definitely give it a try. I expect the performance to be much better.

nyadla-sys commented 1 year ago

@ggerganov is it possible to convert ggml model from fp16 to int8 activations and int8 weights ?

ggerganov commented 1 year ago

8-bit is not supported yet - maybe in the future

trholding commented 1 year ago

Do you need me to test this on a raspi-zero? I bet it would be very very slow.

ggerganov commented 1 year ago

It will be very slow - yes. But still interesting to see how long it would take to process jfk.wav.

trholding commented 1 year ago

No cigar

I have the old raspi zero w, and it was not connected to internet to update clock.

Short Log: make: warning: Clock skew detected. Your build may be incomplete. whisper_model_load: ggml ctx size = 84.99 MB Illegal instruction

Full Log:

pi@zero:~/X/whisper.pi $ time make tiny.en
make: Warning: File 'Makefile' has modification time 6366223 s in the future
cc  -O3 -std=c11   -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access   -c ggml.c
g++ -O3 -std=c++11 -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread -c whisper.cpp
In file included from /usr/include/c++/10/bits/stl_algo.h:61,
                 from /usr/include/c++/10/algorithm:62,
                 from whisper.cpp:5:
/usr/include/c++/10/bits/stl_heap.h: In function ‘void std::__adjust_heap(_RandomAccessIterator, _Distance, _Distance, _Tp, _Compare) [with _RandomAccessIterator = __gnu_cxx::__normal_iterator<std::pair<double, int>*, std::vector<std::pair<double, int> > >; _Distance = int; _Tp = std::pair<double, int>; _Compare = __gnu_cxx::__ops::_Iter_comp_iter<whisper_sample_best(const whisper_vocab&, const float*, bool)::<lambda(const std::pair<double, int>&, const std::pair<double, int>&)> >]’:
/usr/include/c++/10/bits/stl_heap.h:223:5: note: parameter passing for argument of type ‘__gnu_cxx::__normal_iterator<std::pair<double, int>*, std::vector<std::pair<double, int> > >’ changed in GCC 7.1
  223 |     __adjust_heap(_RandomAccessIterator __first, _Distance __holeIndex,
      |     ^~~~~~~~~~~~~
/usr/include/c++/10/bits/stl_heap.h: In function ‘void std::__adjust_heap(_RandomAccessIterator, _Distance, _Distance, _Tp, _Compare) [with _RandomAccessIterator = __gnu_cxx::__normal_iterator<std::pair<double, int>*, std::vector<std::pair<double, int> > >; _Distance = int; _Tp = std::pair<double, int>; _Compare = __gnu_cxx::__ops::_Iter_comp_iter<whisper_sample_timestamp(const whisper_vocab&, const float*)::<lambda(const std::pair<double, int>&, const std::pair<double, int>&)> >]’:
/usr/include/c++/10/bits/stl_heap.h:223:5: note: parameter passing for argument of type ‘__gnu_cxx::__normal_iterator<std::pair<double, int>*, std::vector<std::pair<double, int> > >’ changed in GCC 7.1
In file included from /usr/include/c++/10/vector:72,
                 from whisper.cpp:15:
/usr/include/c++/10/bits/vector.tcc: In member function ‘void std::vector<_Tp, _Alloc>::_M_realloc_insert(std::vector<_Tp, _Alloc>::iterator, _Args&& ...) [with _Args = {std::pair<double, int>}; _Tp = std::pair<double, int>; _Alloc = std::allocator<std::pair<double, int> >]’:
/usr/include/c++/10/bits/vector.tcc:426:7: note: parameter passing for argument of type ‘std::vector<std::pair<double, int> >::iterator’ changed in GCC 7.1
  426 |       vector<_Tp, _Alloc>::
      |       ^~~~~~~~~~~~~~~~~~~
/usr/include/c++/10/bits/vector.tcc: In function ‘whisper_vocab::id whisper_sample_best(const whisper_vocab&, const float*, bool)’:
/usr/include/c++/10/bits/vector.tcc:121:21: note: parameter passing for argument of type ‘__gnu_cxx::__normal_iterator<std::pair<double, int>*, std::vector<std::pair<double, int> > >’ changed in GCC 7.1
  121 |    _M_realloc_insert(end(), std::forward<_Args>(__args)...);
      |    ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/10/bits/vector.tcc: In function ‘whisper_vocab::id whisper_sample_timestamp(const whisper_vocab&, const float*)’:
/usr/include/c++/10/bits/vector.tcc:121:21: note: parameter passing for argument of type ‘__gnu_cxx::__normal_iterator<std::pair<double, int>*, std::vector<std::pair<double, int> > >’ changed in GCC 7.1
  121 |    _M_realloc_insert(end(), std::forward<_Args>(__args)...);
      |    ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/10/bits/vector.tcc: In member function ‘void std::vector<_Tp, _Alloc>::_M_realloc_insert(std::vector<_Tp, _Alloc>::iterator, _Args&& ...) [with _Args = {whisper_result}; _Tp = whisper_result; _Alloc = std::allocator<whisper_result>]’:
/usr/include/c++/10/bits/vector.tcc:426:7: note: parameter passing for argument of type ‘std::vector<whisper_result>::iterator’ changed in GCC 7.1
  426 |       vector<_Tp, _Alloc>::
      |       ^~~~~~~~~~~~~~~~~~~
/usr/include/c++/10/bits/vector.tcc: In member function ‘void std::vector<_Tp, _Alloc>::_M_realloc_insert(std::vector<_Tp, _Alloc>::iterator, _Args&& ...) [with _Args = {whisper_segment}; _Tp = whisper_segment; _Alloc = std::allocator<whisper_segment>]’:
/usr/include/c++/10/bits/vector.tcc:426:7: note: parameter passing for argument of type ‘std::vector<whisper_segment>::iterator’ changed in GCC 7.1
/usr/include/c++/10/bits/vector.tcc: In function ‘int whisper_full(whisper_context*, whisper_full_params, const float*, int)’:
/usr/include/c++/10/bits/vector.tcc:121:21: note: parameter passing for argument of type ‘__gnu_cxx::__normal_iterator<whisper_result*, std::vector<whisper_result> >’ changed in GCC 7.1
  121 |    _M_realloc_insert(end(), std::forward<_Args>(__args)...);
      |    ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/10/bits/vector.tcc:121:21: note: parameter passing for argument of type ‘__gnu_cxx::__normal_iterator<whisper_segment*, std::vector<whisper_segment> >’ changed in GCC 7.1
  121 |    _M_realloc_insert(end(), std::forward<_Args>(__args)...);
      |    ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/10/bits/vector.tcc:121:21: note: parameter passing for argument of type ‘__gnu_cxx::__normal_iterator<whisper_segment*, std::vector<whisper_segment> >’ changed in GCC 7.1
  121 |    _M_realloc_insert(end(), std::forward<_Args>(__args)...);
      |    ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
g++ -O3 -std=c++11 -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread main.cpp whisper.o ggml.o -o main
./main -h

usage: ./main [options] file0.wav file1.wav ...

options:
  -h,       --help           show this help message and exit
  -s SEED,  --seed SEED      RNG seed (default: -1)
  -t N,     --threads N      number of threads to use during computation (default: 1)
  -o N,     --offset N       offset in milliseconds (default: 0)
  -v,       --verbose        verbose output
            --translate      translate from source language to english
  -otxt,    --output-txt     output result in a text file
  -ovtt,    --output-vtt     output result in a vtt file
  -osrt,    --output-srt     output result in a srt file
  -ps,      --print_special  print special tokens
  -nt,      --no_timestamps  do not print timestamps
  -l LANG,  --language LANG  spoken language (default: en)
  -m FNAME, --model FNAME    model path (default: models/ggml-base.en.bin)
  -f FNAME, --file FNAME     input WAV file path

bash ./download-ggml-model.sh tiny.en
Downloading ggml model tiny.en ...
Model tiny.en already exists. Skipping download.

===============================================
Running tiny.en on all samples in ./samples ...
===============================================

----------------------------------------------
[+] Running base.en on samples/jfk.wav ... (run 'ffplay samples/jfk.wav' to listen)
----------------------------------------------

whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 244.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  84.99 MB
Illegal instruction

make: warning:  Clock skew detected.  Your build may be incomplete.

real    5m28.556s
user    5m21.719s
sys     0m4.545s

pi@zero:~/X/whisper.pi $ cat /proc/cpuinfo 
processor       : 0
model name      : ARMv6-compatible processor rev 7 (v6l)
BogoMIPS        : 996.14
Features        : half thumb fastmult vfp edsp java tls 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x0
CPU part        : 0xb76
CPU revision    : 7

Hardware        : BCM2835
Revision        : 9000c1
Serial          : XXXXXX (Serial Removed)
Model           : Raspberry Pi Zero W Rev 1.1

I think its this flag: -mfpu=neon-fp-armv8 as we are on ARMv6...

Extremely unwell. Will continue experiments soon. I hope this will help you.

ggerganov commented 1 year ago

I think its this flag: -mfpu=neon-fp-armv8 as we are on ARMv6...

Yes - you are probably right. What is the output of: gcc -c -Q -mcpu=native --help=target and cat /proc/cpuinfo

trholding commented 1 year ago

GCC info:

pi@zero:~ $ gcc -c -Q -mcpu=native --help=target 
The following options are target specific:
  -mabi=                                aapcs-linux
  -mabort-on-noreturn                   [disabled]
  -mandroid                             [disabled]
  -mapcs                                [disabled]
  -mapcs-frame                          [disabled]
  -mapcs-reentrant                      [disabled]
  -mapcs-stack-check                    [disabled]
  -march=                               armv6kz+fp
  -marm                                 [enabled]
  -masm-syntax-unified                  [disabled]
  -mbe32                                [enabled]
  -mbe8                                 [disabled]
  -mbig-endian                          [disabled]
  -mbionic                              [disabled]
  -mbranch-cost=                        -1
  -mcallee-super-interworking           [disabled]
  -mcaller-super-interworking           [disabled]
  -mcmse                                [disabled]
  -mcpu=                                arm1176jzf-s
  -mfdpic                               [disabled]
  -mfix-cortex-m3-ldrd                  [disabled]
  -mflip-thumb                          [disabled]
  -mfloat-abi=                          hard
  -mfp16-format=                        none
  -mfpu=                                vfp
  -mgeneral-regs-only                   [disabled]
  -mglibc                               [enabled]
  -mhard-float                          -mfloat-abi=hard
  -mlittle-endian                       [enabled]
  -mlong-calls                          [disabled]
  -mmusl                                [disabled]
  -mneon-for-64bits                     [disabled]
  -mpic-data-is-text-relative           [enabled]
  -mpic-register=             
  -mpoke-function-name                  [disabled]
  -mprint-tune-info                     [disabled]
  -mpure-code                           [disabled]
  -mrestrict-it                         [disabled]
  -msched-prolog                        [enabled]
  -msingle-pic-base                     [disabled]
  -mslow-flash-data                     [disabled]
  -msoft-float                          -mfloat-abi=soft
  -mstructure-size-boundary=            8
  -mthumb                               [disabled]
  -mthumb-interwork                     [disabled]
  -mtls-dialect=                        gnu
  -mtp=                                 cp15
  -mtpcs-frame                          [disabled]
  -mtpcs-leaf-frame                     [disabled]
  -mtune=                     
  -muclibc                              [disabled]
  -munaligned-access                    [enabled]
  -mvectorize-with-neon-double          [disabled]
  -mvectorize-with-neon-quad            [enabled]
  -mword-relocations                    [disabled]

  Known ARM ABIs (for use with the -mabi= option):
    aapcs aapcs-linux apcs-gnu atpcs iwmmxt

  Known __fp16 formats (for use with the -mfp16-format= option):
    alternative ieee none

  Known ARM FPUs (for use with the -mfpu= option):
    auto crypto-neon-fp-armv8 fp-armv8 fpv4-sp-d16 fpv5-d16 fpv5-sp-d16 neon neon-fp-armv8 neon-fp16 neon-vfpv3 neon-vfpv4 vfp
    vfp3 vfpv2 vfpv3 vfpv3-d16 vfpv3-d16-fp16 vfpv3-fp16 vfpv3xd vfpv3xd-fp16 vfpv4 vfpv4-d16

  Valid arguments to -mtp=:
    auto cp15 soft

  Known floating-point ABIs (for use with the -mfloat-abi= option):
    hard soft softfp

  TLS dialect to use:
    gnu gnu2

CPU Info:

pi@zero:~ $ cat /proc/cpuinfo
processor       : 0
model name      : ARMv6-compatible processor rev 7 (v6l)
BogoMIPS        : 996.14
Features        : half thumb fastmult vfp edsp java tls 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x0
CPU part        : 0xb76
CPU revision    : 7

Hardware        : BCM2835
Revision        : 9000c1
Serial          : XXXXXXXXXXX (Removed)
Model           : Raspberry Pi Zero W Rev 1.1

Info: https://gist.github.com/fm4dd/c663217935dc17f0fc73c9c81b0aa845

ggerganov commented 1 year ago

Yeah, I'm not an expert when it comes to arm architectures and compile flags. Maybe try replacing -mfpu=neon-fp-armv8 with -mfpu=vfp and see if it helps. But likely some of the SIMD intrinsics that I use is not supported on this chip. Anyway, thanks for giving it a try.

trholding commented 1 year ago

Got Cigar after 35 Minutes!

But it was damn slow and I could take a screenshot only because I was using mosh.

First it did not work with the makefile compiler flags alone. I had to comment out a line in ggml.c

Makefile Line 38: removed -mfpu=neon-fp-armv8 -mfpu=vfp added,

ifneq ($(filter armv6%,$(UNAME_M)),)
        # Raspberry Pi 0, 1, 2, 3 
        CFLAGS += -mfpu=vfp -mfp16-format=ieee -mno-unaligned-access
endif

ggml.c Line 70: Comment out or ifdef the following for (32bit / Raspi 0 1 2 3)

// #include <immintrin.h>

whisperPI0

trholding commented 1 year ago

I have a Kindle which is jailbroken, has a alpine distro and X, I'll try it on that too when I am well.

It has a i.MX 6ULL with Cortex-A7 @528 MHz . I had booted a x86_64 custom linux with X on it already with qemu in alpine arm. It sort of worked well to my surprise.

I think for whisper, it may be twice as fast as raspi 0.

How hard would it be for you to support opencl as an additional backend for ggml? It would be a great use case as opencl could help accelerate it even on raspis, amd systems, android phones and other low power devices that have a gpu.

trholding commented 1 year ago

@nyadla-sys I think the answer is yes and probably this could be closed. :)

ggerganov commented 1 year ago

No plans to support OpenCL in the near future - it's quite a lot of effort.

I think the answer is yes and probably this could be closed. :)

Agreed - will close this issue for now to reduce some clutter.

StuartIanNaylor commented 1 year ago

ROCK 5B Rockchip RK3588 ARM Cortex-A76

rock@rock-5b:~/whisper.cpp$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

whisper_print_timings:     load time =   318.74 ms
whisper_print_timings:      mel time =   123.62 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  6228.12 ms / 1038.02 ms per layer
whisper_print_timings:   decode time =   758.88 ms / 126.48 ms per layer
whisper_print_timings:    total time =  7442.09 ms
ggerganov commented 1 year ago

@StuartIanNaylor That's awesome - the performance is kind of what I expected. Based on the jfk.wav example that you showed, I think you should be able to process a full 30 sec audio in about 10 seconds using the base model.

Do you have any applications in mind?

StuartIanNaylor commented 1 year ago

Just so you can see also ran from nvme this time RK3588

rock@rock-5b:~/nvme/whisper.cpp$ ./main -m models/ggml-base.en.bin -f samples/gb1.wav -t 8
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/gb1.wav' (3179927 samples, 198.7 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:09.640]   My fellow Americans, this day has brought terrible news and great sadness to our country.
[00:00:09.640 --> 00:00:15.920]   At nine o'clock this morning, Mission Control in Houston lost contact with our Space Shuttle
[00:00:15.920 --> 00:00:17.440]   Columbia.
[00:00:17.440 --> 00:00:24.640]   A short time later, debris was seen falling from the skies above Texas.
[00:00:24.640 --> 00:00:27.200]   The Columbia's lost.
[00:00:27.200 --> 00:00:29.880]   There are no survivors.
[00:00:00.000 --> 00:00:03.040]   One board was a crew of seven.
[00:00:03.040 --> 00:00:10.200]   Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark, Captain Dav                                                                id
[00:00:10.200 --> 00:00:20.640]   Brown, Commander William McCool, Dr. Kultna Shavla, and Elon Ramon, a colonel in the Israeli
[00:00:20.640 --> 00:00:22.920]   Air Force.
[00:00:22.920 --> 00:00:27.560]   These men and women assumed great risk in the service to all humanity.
[00:00:00.000 --> 00:00:06.680]   In an age when spaceflight has come to seem almost routine, it is easy to overlook the
[00:00:06.680 --> 00:00:12.960]   dangers of travel by rocket and the difficulties of navigating the fierce outer atmosphere of
[00:00:12.960 --> 00:00:15.160]   the Earth.
[00:00:15.160 --> 00:00:21.840]   These astronauts knew the dangers, and they faced them willingly, knowing they had a high
[00:00:21.840 --> 00:00:25.520]   and noble purpose in life.
[00:00:00.000 --> 00:00:07.760]   Because of their courage and daring and idealism, we will miss them all the more.
[00:00:07.760 --> 00:00:13.560]   All Americans today are thinking as well of the families of these men and women who have
[00:00:13.560 --> 00:00:17.440]   been given this sudden shock in grief.
[00:00:17.440 --> 00:00:19.320]   You're not alone.
[00:00:19.320 --> 00:00:25.360]   Our entire nation agrees with you, and those you loved will always have the respect and
[00:00:25.360 --> 00:00:29.320]   gratitude of this country.
[00:00:00.000 --> 00:00:04.240]   The cause in which they died will continue.
[00:00:04.240 --> 00:00:11.840]   Mankind has led into the darkness beyond our world by the inspiration of discovery and
[00:00:11.840 --> 00:00:14.720]   the longing to understand.
[00:00:14.720 --> 00:00:18.840]   Our journey into space will go on.
[00:00:18.840 --> 00:00:24.160]   In the skies today, we saw destruction and tragedy.
[00:00:24.160 --> 00:00:29.760]   As farther than we can see, there is comfort and hope.
[00:00:00.000 --> 00:00:07.720]   In the words of the prophet Isaiah, "Lift your eyes and look to the heavens who created
[00:00:07.720 --> 00:00:17.240]   all these, he who brings out the starry hosts one by one and calls them each by name."
[00:00:17.240 --> 00:00:24.360]   Because of his great power and mighty strength, not one of them is missing.
[00:00:00.000 --> 00:00:07.320]   The same creator who names the stars also knows the names of the seven souls we mourn
[00:00:07.320 --> 00:00:09.160]   today.
[00:00:09.160 --> 00:00:16.840]   The crew of the shuttle Columbia did not return safely to Earth yet we can pray that all are
[00:00:16.840 --> 00:00:19.440]   safely home.
[00:00:19.440 --> 00:00:26.240]   May God bless the grieving families and may God continue to bless America.
[00:00:00.000 --> 00:00:10.000]   [BLANK_AUDIO]

whisper_print_timings:     load time =   385.00 ms
whisper_print_timings:      mel time =  1659.99 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 49284.12 ms / 8214.02 ms per layer
whisper_print_timings:   decode time = 22127.64 ms / 3687.94 ms per layer
whisper_print_timings:    total time = 73697.95 ms

My now getting old Xeon(R) CPU E3-1245

./main -m models/ggml-base.en.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB 
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

whisper_print_timings:     load time =   221.60 ms
whisper_print_timings:      mel time =    85.55 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  1707.26 ms / 284.54 ms per layer
whisper_print_timings:   decode time =   183.90 ms / 30.65 ms per layer
whisper_print_timings:    total time =  2211.89 ms

Not to sure about the streaming input as just full of time outs and strangely Whisper seems to think I am blowing kisses, maybe at your code but certainly not in dictation :)

I was just browsing and saw the mentions of trying to run on a Pi4 and that a Pi5 might be soon so cloned the repo to benchmark on the Rock5b as I have one. It also has a 6Tops NPU & MaliG610 but still playing BSP image whilst trickling submissions to mainline, but its new and was avail to benchmark. As for app not really because for a user like me I think we are missing some key elements whilst some of the more mainstream opensource seems wide of the mark to me and still is absent of some essentials. I think we have some really good frameworks such as Speechbrain & ESPNet and likes, but when it comes to a user friendly all-in-one such as Mycroft or Rhasspy, I think they are really poor with a lot of appropriation of loose licenced opensource and rebranding to own IP for IP sake. Also they have extremely complex framework protocols, whilst I am a bit old school and believe in Linux as file system and simplicity. Biggest thing we are missing are the DSP audio algs there isn't even a decent beamformer or realtime BSS on Linux so unless your recording broadcast mic style the input audio generally is of poor quality and that effects upstream.

I did molest a realtime Delay-Sum beamformer out of frustration, but can not find anything for BSS (blind source separation) as also thought I can run a KWS x2 and select the best stream on the KWS that returns the argmax. Been some recent talks but not much code on Target Speaker Separation like Googles VoicefilterLite that they are keeping to there own chest. https://github.com/BUTSpeechFIT/speakerbeam

There are some really great ASR / TTS/ NLU-NLP packages-repos but on Linux we a really short on solutions for the initial audio stream processing, unless you are recording in a broadcast like scenario. Even the opensource KWS are extremely flakey say apart from https://github.com/google-research/google-research/tree/master/kws_streaming

Ps if you want shock and horror at my introduction and 1st C program the delay-sum beamformer is here https://github.com/StuartIanNaylor/ProjectEars/tree/main/ds

I think what you have done is amazing but there are certain key elements I feel are missing earlier in the audio stream processing that stop me bothering and have been banging this drum for a couple of year now.

StuartIanNaylor commented 1 year ago

@ggerganov If you have them one time would you post the Pi 4 outputs as a comparison as don't have a Pi4 anymore but would like to compare.

ggerganov commented 1 year ago

@StuartIanNaylor Here is the output on my RPi4:

pi@raspberrypi:~/whisper.cpp $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 4
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB 
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

whisper_print_timings:     load time =  1851.33 ms
whisper_print_timings:      mel time =   270.67 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 33790.07 ms / 5631.68 ms per layer
whisper_print_timings:   decode time =  1287.69 ms / 214.61 ms per layer
whisper_print_timings:    total time = 37281.19 ms

Btw, thanks for the earlier detailed information - I'm new to the speech-recognition field, so I am not familiar with most of the terminology, but still - appreciate the information!

RyanSelesnik commented 1 year ago

Some more experiments - enabling NEON instructions reduces the time all the way down to just ~15 seconds to process a 30 second audio.

@ggerganov Hi, how do you enable NEON? or do you know of any other methods to speed up inference? I'm on a Raspberry PI 4B Armv8:

yan@raspberrypi:~/Desktop/whisper.cpp$ cat /proc/cpuinfo
processor   : 0
BogoMIPS    : 108.00
Features    : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part    : 0xd08
CPU revision    : 3

This is what I get for jfk.wav

===============================================
Running tiny.en on all samples in ./samples ...
===============================================

----------------------------------------------
[+] Running base.en on samples/jfk.wav ... (run 'ffplay samples/jfk.wav' to listen)
----------------------------------------------

whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 476.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:07.740]   And so my fellow Americans ask not what your country can do for you
[00:00:07.740 --> 00:00:10.740]   ask what you can do for your country

whisper_print_timings:     load time =   920.53 ms
whisper_print_timings:      mel time =   282.52 ms
whisper_print_timings:   sample time =    19.86 ms
whisper_print_timings:   encode time =  8879.22 ms / 2219.81 ms per layer
whisper_print_timings:   decode time =   671.52 ms / 167.88 ms per layer
whisper_print_timings:    total time = 10789.05 ms

This is what I get for a 4 sec wav:

ryan@raspberrypi:~/Desktop/whisper.cpp$ ./main -m models/ggml-tiny.en.bin -f test.wav
whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 476.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'test.wav' (80000 samples, 5.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:04.000]   Hello, my name is Ryan.

whisper_print_timings:     load time =   937.94 ms
whisper_print_timings:      mel time =   110.76 ms
whisper_print_timings:   sample time =     9.65 ms
whisper_print_timings:   encode time =  8791.28 ms / 2197.82 ms per layer
whisper_print_timings:   decode time =   248.15 ms / 62.04 ms per layer
whisper_print_timings:    total time = 10099.49 ms

Any help would be greatly appreciated.

ggerganov commented 1 year ago

@RyanSelesnik I think the Makefile is missing flags for Armv8.

You can compile manually like this:

gcc -I.              -O3 -std=c11   -pthread -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations   -c ggml.c
g++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp
g++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main

Interesting what performance you are going to get. It's already pretty good even without NEON. What is this model by the way? I have an RPi4 B from 2018 and it has ARMv7 processor, not ARMv8.

StuartIanNaylor commented 1 year ago

@RyanSelesnik I think the Makefile is missing flags for Armv8.

You can compile manually like this:

gcc -I.              -O3 -std=c11   -pthread -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations   -c ggml.c
g++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp
g++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main

Interesting what performance you are going to get. It's already pretty good even without NEON. What is this model by the way? I have an RPi4 B from 2018 and it has ARMv7 processor, not ARMv8.

Just depends what OS you are running as the BCM2711 will run both instruction sets depending on if 32bit or 64bit OS.

/home/ryan/Desktop/whisper/venv/lib/python3.9/site-packages/whisper/transcribe.py:78: UserWarning: FP16 is not supported on CPU; using FP32 instead warnings.warn("FP16 is not supported on CPU; using FP32 instead") Illegal instruction

Why you are running OPenAi's python version of Whisper and posting issues on ggerganov's cpp port of whisper is curious though?

andres-ramirez-duque commented 1 year ago

Some more experiments - enabling NEON instructions reduces the time all the way down to just ~15 seconds to process a 30 second audio.

@ggerganov Hi, how do you enable NEON? or do you know of any other methods to speed up inference? I'm on a Raspberry PI 4B Armv8:

yan@raspberrypi:~/Desktop/whisper.cpp$ cat /proc/cpuinfo
processor : 0
BogoMIPS  : 108.00
Features  : fp asimd evtstrm crc32 cpuid
CPU implementer   : 0x41
CPU architecture: 8
CPU variant   : 0x0
CPU part  : 0xd08
CPU revision  : 3

This is what I get for jfk.wav

===============================================
Running tiny.en on all samples in ./samples ...
===============================================

----------------------------------------------
[+] Running base.en on samples/jfk.wav ... (run 'ffplay samples/jfk.wav' to listen)
----------------------------------------------

whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 476.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:07.740]   And so my fellow Americans ask not what your country can do for you
[00:00:07.740 --> 00:00:10.740]   ask what you can do for your country

whisper_print_timings:     load time =   920.53 ms
whisper_print_timings:      mel time =   282.52 ms
whisper_print_timings:   sample time =    19.86 ms
whisper_print_timings:   encode time =  8879.22 ms / 2219.81 ms per layer
whisper_print_timings:   decode time =   671.52 ms / 167.88 ms per layer
whisper_print_timings:    total time = 10789.05 ms

This is what I get for a 4 sec wav:

ryan@raspberrypi:~/Desktop/whisper.cpp$ ./main -m models/ggml-tiny.en.bin -f test.wav
whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 476.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'test.wav' (80000 samples, 5.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:04.000]   Hello, my name is Ryan.

whisper_print_timings:     load time =   937.94 ms
whisper_print_timings:      mel time =   110.76 ms
whisper_print_timings:   sample time =     9.65 ms
whisper_print_timings:   encode time =  8791.28 ms / 2197.82 ms per layer
whisper_print_timings:   decode time =   248.15 ms / 62.04 ms per layer
whisper_print_timings:    total time = 10099.49 ms

Any help would be greatly appreciated.

Hi @RyanSelesnik did you manage to compile enabling NEON? or What have you done to improve performance?

RyanSelesnik commented 1 year ago

@ggerganov It seems as though Aarch64 has compiler optimisations enabled by default:

Arm C/C++ Compiler automatically vectorizes your code at the -O2, -O3, and -Ofast higher optimization levels.

See here and here.

However, this is not indicated when running the code

AVX2 = 0 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

I am quite a newb, but do the flags above mean that optimisations are not enabled, even are enabled by default?

ggerganov commented 1 year ago

@RyanSelesnik Should be fixed now - ggml didn't correctly report that NEON is used

andres-ramirez-duque commented 1 year ago

Hi, @ggerganov thanks for your work, below what I got:

ubuntu@ubuntu:~/usrlib/whisper.cpp$ ./main -m models/ggml-tiny.en.bin -f samples/jfk.wav whisper_model_load: loading model from 'models/ggml-tiny.en.bin' whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 384 whisper_model_load: n_audio_head = 6 whisper_model_load: n_audio_layer = 4 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 384 whisper_model_load: n_text_head = 6 whisper_model_load: n_text_layer = 4 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 1 whisper_model_load: mem_required = 390.00 MB whisper_model_load: adding 1607 extra tokens whisper_model_load: ggml ctx size = 73.58 MB whisper_model_load: memory size = 11.41 MB whisper_model_load: model size = 73.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:07.740] And so my fellow Americans ask not what your country can do for you [00:00:07.740 --> 00:00:10.740] ask what you can do for your country

whisper_print_timings: load time = 2345.03 ms whisper_print_timings: mel time = 206.80 ms whisper_print_timings: sample time = 22.16 ms whisper_print_timings: encode time = 8368.69 ms / 2092.17 ms per layer whisper_print_timings: decode time = 805.82 ms / 201.46 ms per layer whisper_print_timings: total time = 11752.22 ms

Some other improvement that you have in mind, maybe I can contribute Cheers

ggerganov commented 1 year ago

@andres-ramirez-duque There are some minor optimizations pending in ggml but these would bring a few per cent improvement at best (if any). The next big optimization might be going to 8-bit weights, but I am not sure how to implement this exactly.

I am more interested in finding a way to somehow evaluate the transformer "partially" in order to gain speed and reduce memory usage, even if it costs losing a lot of accuracy. I tried to "downsample" the feed-forward layers by factor of x2 by merging the neighbouring weights, but it stops working completely. I think the idea of "downsampling" the layers feels not completely unreasonable, but either I have a bug somewhere or I am missing something and this cannot actually work. This idea is in the model-compression branch, last 2 commits.

Other than that, I don't have any other good ideas at the moment.

StuartIanNaylor commented 1 year ago

Is it possible to split the layers or the encoder / decoder as maybe you could run 50/50 cpu/gpu with metal? Or other Arm devices with opencl with the likes of armnn? Still amazed at your own tensor lib so who knows maybe you can provide a Vulkan version?

ggerganov commented 1 year ago

@andres-ramirez-duque and all,

If you are still interested in real-time audio transcription on Raspberry, please give a try of the stream branch:

git fetch --all
git checkout stream
git reset --hard origin/stream

make clean
make stream

./stream -m ./models/ggml-tiny.en.bin -t 4 --step 4000 --length 8000

I am not sure if it will work because I don't have a USB microphone to test with. Let me know if you give it a try. Thanks

nyadla-sys commented 1 year ago

@ggerganov rpi4 system is getting crashed while running ./stream -m ./models/ggml-tiny.en.bin -t 4 --step 4000 --length 8000

StuartIanNaylor commented 1 year ago

@ggerganov rpi4 system is getting crashed while running ./stream -m ./models/ggml-tiny.en.bin -t 4 --step 4000 --length 8000

Probably as you just don't have enough cpu to run realtime, the streaming mode adds considerable load onto even the tiny model that its already struggling with. I think streaming mode might be better on lesser CPU's if fed and chunked via some form of VAD based on a pause is prob 200-300ms, that can queue.

ggerganov commented 1 year ago

@nyadla-sys Thanks for giving it a try. Can you provide logs from the make and stream commands? Also, you can retry downloading the tiny.en model, just in case you have it partially downloaded somehow.

@StuartIanNaylor Even if the CPU is not enough for realtime, the program does not crash. There must be something wrong in the environment. The hope is that with the partial encoder inference, it would be enough for realtime even without adding VAD.

StuartIanNaylor commented 1 year ago

Maybe there is something wrong with the make or stream commands as I think it was only the tiny.en model that would run on my Xeon E3-1245. On the rk3588 it was sort of really strange gibbergish where it would repeat and freeze with extremely low accuracy as if the input was looping or something. I just didn't think there was a chance and if not faster than realtime what and how is it going to stream?

ggerganov commented 1 year ago

So I bought a cheap mic today and tested this using my Raspberry Pi 4:

https://user-images.githubusercontent.com/1991296/201538687-65ebf070-e821-48b9-a80c-bc66249d2f26.mp4

This is using the tiny.en model with a time step of 4 seconds and 3 threads. The transcription quality is quite low due to the change in #137 and having no context across the time steps, but at least it is running in real-time.

Not sure if this is useful, or how it compares to other voice-to-text solutions for Raspberry. What do you guys think?

Edit: If you increase the time step to about 7.5 seconds, the accuracy improves significantly:

https://user-images.githubusercontent.com/1991296/201540378-0e56be2e-e809-4b4c-a4ae-ba392a2d041a.mp4

StuartIanNaylor commented 1 year ago

Looks good as will have to try as seems vastly different to when I tried as that is amazingly good to the results I got.

andres-ramirez-duque commented 1 year ago

Wow! @ggerganov it looks like impressive work, such a big difference from the previous week!

The quality can improve, but it is a great first step, I leave below the result of my transcriptions, I used 4 threads instead of 3. Feel free to ask me for quick tests or benchmark runs on the raspberry.

ubuntu@ubuntu:~/usrlib/whisper.cpp$ ./stream -m ./models/ggml-tiny.en.bin -t 4 --step 4000 --length 8000 -c 0 audio_sdl_init: found 1 capture devices: audio_sdl_init: - Capture device #0: 'Sennheiser USB headset, USB Audio' audio_sdl_init: attempt to open capture device 0 : 'Sennheiser USB headset, USB Audio' ... audio_sdl_init: obtained spec for input device (SDL Id = 2): audio_sdl_init: - sample rate: 16000 audio_sdl_init: - format: 33056 (required: 33056) audio_sdl_init: - channels: 1 (required: 1) audio_sdl_init: - samples per frame: 1024 whisper_model_load: loading model from './models/ggml-tiny.en.bin' whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 384 whisper_model_load: n_audio_head = 6 whisper_model_load: n_audio_layer = 4 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 384 whisper_model_load: n_text_head = 6 whisper_model_load: n_text_layer = 4 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 1 whisper_model_load: mem_required = 390.00 MB whisper_model_load: adding 1607 extra tokens whisper_model_load: ggml ctx size = 73.58 MB whisper_model_load: memory size = 11.41 MB whisper_model_load: model size = 73.54 MB

main: processing 64000 samples (step = 4.0 sec / len = 8.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 0 ... main: n_new_line = 1

we don't have the density of all the society and I think that doesn't seem to me. me intractable. It's just something that we have to deal with. It seems weird that the Twitter... but like really crappy Taylor bots are so numerous. I guess he said, so I presume that the engineers of Twitter are... very good so it seems like what I would infer from that. Is it seem like a hard problem? It did the problem. catching or if I were to sort of steal that in the case. It's a hard problem and there's a huge cost too. False positive two to removing a post by somebody that's not a part. That's a crazy very bad user experience, so they're very cautious about it. the maybe the bathroom maybe the bathroom maybe the bathroom really good at learning what gets removed and not, especially if they can stay. they added the removal process very quickly. Mind pressure of it honestly. There's a lot of one for it. I mean, just that's what I... It's not my impression of if it's not but you have to be Yeah, that's my impression as well, but it feels like maybe... maybe you're seeing the tip of the iceberg, maybe the number one. A couple of boxes in like the trillions, and you have to like... Just, it's a constant assault of the body. You're dead enough. I mean you have to steal many of the keys because it's about time. I'm seeing a pretty obvious I can write a few lines of code that counts this way. spots. I mean definitely there's a lot of blue in front but I will say I agree that if you If you are a sophisticated natur, you could probably create a pretty good bot right now. you know, using tools like GPDs because it's a language model you can... and generate faces that look quite good now. And you can... do this as bail and so I think it's quite plausible and it's good. going to be hard to defend. There was a Google engineer that claimed that the... Lemptos, essentially. Do you think there's any in the inkling of truth to what he felt. and more importantly to me at least, dease.

ubuntu@ubuntu:~/usrlib/whisper.cpp$ ./stream -m ./models/ggml-tiny.en.bin -t 4 --step 7680 --length 15360 -c 0 audio_sdl_init: found 1 capture devices: audio_sdl_init: - Capture device #0: 'Sennheiser USB headset, USB Audio' audio_sdl_init: attempt to open capture device 0 : 'Sennheiser USB headset, USB Audio' ... audio_sdl_init: obtained spec for input device (SDL Id = 2): audio_sdl_init: - sample rate: 16000 audio_sdl_init: - format: 33056 (required: 33056) audio_sdl_init: - channels: 1 (required: 1) audio_sdl_init: - samples per frame: 1024 whisper_model_load: loading model from './models/ggml-tiny.en.bin' whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 384 whisper_model_load: n_audio_head = 6 whisper_model_load: n_audio_layer = 4 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 384 whisper_model_load: n_text_head = 6 whisper_model_load: n_text_layer = 4 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 1 whisper_model_load: mem_required = 390.00 MB whisper_model_load: adding 1607 extra tokens whisper_model_load: ggml ctx size = 73.58 MB whisper_model_load: memory size = 11.41 MB whisper_model_load: model size = 73.54 MB

main: processing 122880 samples (step = 7.7 sec / len = 15.4 sec), 4 threads, lang = en, task = transcribe, timestamps = 0 ... main: n_new_line = 1

I think it's possible that the success exploits are we should be trying to find them arranging. some kind of a crazy quantum mechanical system that somehow gives you buffer overflow, somehow gives you browning air in the floating point, synthetic. Intelligences are kind of like the next stage of development and I don't know where it leads to like at some point I suspect The universe is some kind of a puzzle. These synthetic eyes will uncover that puzzle end. Solving. The following is a conversation with Patrick and Pothi, previously the director of AI. and before that, it opened aye and Stanford. He is one of the greatest scientists. engineers and educators in the history of artificial intelligence. This is the the support that we check out our sponsors. Now, dear friends, here's Andre, kapati. What is in your own network? And what does it seem to do such as a prize in the good job of learning? What is in your It's a mathematical abstraction of the brain. I would say that's how it was originally developed. At the end of the day... the data in some mathematical expression and some fairly simple mathematical expression when you get down to it. It's basically a sequence of a metrosumote. which are fully cut products mathematically and some non-linearity is thrown in. So it's a very simple mathematical expression and it's got no... and it many nubs many nubs and these nubs are loosely related to being in the synapses in your brain they're trainable and modifiable the idea is like we need to find the setting of the knobs that makes the neural mat do whatever you want it to do like classify them just and

RYucel commented 1 year ago

Wow! @ggerganov it looks like impressive work, such a big difference from the previous week!

The quality can improve, but it is a great first step, I leave below the result of my transcriptions, I used 4 threads instead of 3. Feel free to ask me for quick tests or benchmark runs on the raspberry.

ubuntu@ubuntu:~/usrlib/whisper.cpp$ ./stream -m ./models/ggml-tiny.en.bin -t 4 --step 4000 --length 8000 -c 0 audio_sdl_init: found 1 capture devices: audio_sdl_init: - Capture device #0: 'Sennheiser USB headset, USB Audio' audio_sdl_init: attempt to open capture device 0 : 'Sennheiser USB headset, USB Audio' ... audio_sdl_init: obtained spec for input device (SDL Id = 2): audio_sdl_init: - sample rate: 16000 audio_sdl_init: - format: 33056 (required: 33056) audio_sdl_init: - channels: 1 (required: 1) audio_sdl_init: - samples per frame: 1024 whisper_model_load: loading model from './models/ggml-tiny.en.bin' whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 384 whisper_model_load: n_audio_head = 6 whisper_model_load: n_audio_layer = 4 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 384 whisper_model_load: n_text_head = 6 whisper_model_load: n_text_layer = 4 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 1 whisper_model_load: mem_required = 390.00 MB whisper_model_load: adding 1607 extra tokens whisper_model_load: ggml ctx size = 73.58 MB whisper_model_load: memory size = 11.41 MB whisper_model_load: model size = 73.54 MB

main: processing 64000 samples (step = 4.0 sec / len = 8.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 0 ... main: n_new_line = 1

we don't have the density of all the society and I think that doesn't seem to me. me intractable. It's just something that we have to deal with. It seems weird that the Twitter... but like really crappy Taylor bots are so numerous. I guess he said, so I presume that the engineers of Twitter are... very good so it seems like what I would infer from that. Is it seem like a hard problem? It did the problem. catching or if I were to sort of steal that in the case. It's a hard problem and there's a huge cost too. False positive two to removing a post by somebody that's not a part. That's a crazy very bad user experience, so they're very cautious about it. the maybe the bathroom maybe the bathroom maybe the bathroom really good at learning what gets removed and not, especially if they can stay. they added the removal process very quickly. Mind pressure of it honestly. There's a lot of one for it. I mean, just that's what I... It's not my impression of if it's not but you have to be Yeah, that's my impression as well, but it feels like maybe... maybe you're seeing the tip of the iceberg, maybe the number one. A couple of boxes in like the trillions, and you have to like... Just, it's a constant assault of the body. You're dead enough. I mean you have to steal many of the keys because it's about time. I'm seeing a pretty obvious I can write a few lines of code that counts this way. spots. I mean definitely there's a lot of blue in front but I will say I agree that if you If you are a sophisticated natur, you could probably create a pretty good bot right now. you know, using tools like GPDs because it's a language model you can... and generate faces that look quite good now. And you can... do this as bail and so I think it's quite plausible and it's good. going to be hard to defend. There was a Google engineer that claimed that the... Lemptos, essentially. Do you think there's any in the inkling of truth to what he felt. and more importantly to me at least, dease.

ubuntu@ubuntu:~/usrlib/whisper.cpp$ ./stream -m ./models/ggml-tiny.en.bin -t 4 --step 7680 --length 15360 -c 0 audio_sdl_init: found 1 capture devices: audio_sdl_init: - Capture device #0: 'Sennheiser USB headset, USB Audio' audio_sdl_init: attempt to open capture device 0 : 'Sennheiser USB headset, USB Audio' ... audio_sdl_init: obtained spec for input device (SDL Id = 2): audio_sdl_init: - sample rate: 16000 audio_sdl_init: - format: 33056 (required: 33056) audio_sdl_init: - channels: 1 (required: 1) audio_sdl_init: - samples per frame: 1024 whisper_model_load: loading model from './models/ggml-tiny.en.bin' whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 384 whisper_model_load: n_audio_head = 6 whisper_model_load: n_audio_layer = 4 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 384 whisper_model_load: n_text_head = 6 whisper_model_load: n_text_layer = 4 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 1 whisper_model_load: mem_required = 390.00 MB whisper_model_load: adding 1607 extra tokens whisper_model_load: ggml ctx size = 73.58 MB whisper_model_load: memory size = 11.41 MB whisper_model_load: model size = 73.54 MB

main: processing 122880 samples (step = 7.7 sec / len = 15.4 sec), 4 threads, lang = en, task = transcribe, timestamps = 0 ... main: n_new_line = 1

I think it's possible that the success exploits are we should be trying to find them arranging. some kind of a crazy quantum mechanical system that somehow gives you buffer overflow, somehow gives you browning air in the floating point, synthetic. Intelligences are kind of like the next stage of development and I don't know where it leads to like at some point I suspect The universe is some kind of a puzzle. These synthetic eyes will uncover that puzzle end. Solving. The following is a conversation with Patrick and Pothi, previously the director of AI. and before that, it opened aye and Stanford. He is one of the greatest scientists. engineers and educators in the history of artificial intelligence. This is the the support that we check out our sponsors. Now, dear friends, here's Andre, kapati. What is in your own network? And what does it seem to do such as a prize in the good job of learning? What is in your It's a mathematical abstraction of the brain. I would say that's how it was originally developed. At the end of the day... the data in some mathematical expression and some fairly simple mathematical expression when you get down to it. It's basically a sequence of a metrosumote. which are fully cut products mathematically and some non-linearity is thrown in. So it's a very simple mathematical expression and it's got no... and it many nubs many nubs and these nubs are loosely related to being in the synapses in your brain they're trainable and modifiable the idea is like we need to find the setting of the knobs that makes the neural mat do whatever you want it to do like classify them just and

Could you provide a simple tutorial or some step by step for implementing this on a rpi? Would be very good solution for our little home project?

StuartIanNaylor commented 1 year ago

I think maybe I hadn't checked out the stream branch and then compiled but results much better. Still on a rk3588 get a repetition of words occasionally as it seems to prefer a dialog than stop, pause, start command style sentences. Presume that is the chunking of the audio to 'ctc' length. Works great though as best ASR I have seen on a Arm based SBC.

@ggerganov Actually playing around with the streaming version I can replicate what it does. Its extremely hard to replicate streaming but if for example you ./main -m ./models/ggml-tiny.en.bin -f ./samples/jfk.wav -t 8

[00:00:00.000 --> 00:00:07.540]   And so my fellow Americans ask not what your country can do for you
[00:00:07.540 --> 00:00:10.160]   ask what you can do for your country.
[00:00:10.160 --> 00:00:30.000]   You can do for your country

ask what you can do for your country is correct as the timings are approx [00:00:07.540 --> 00:00:10.160] It still tacks on what seems to be from approx 8.8 secs of You can do for your country as this fictitious [00:00:10.160 --> 00:00:30.000] of the OpenAi 30 sec window. The same happens in streaming mode with short sentances, its blazing fast in that mode but somehow the input and output is not being cleared for that time step.

Haven't worked it out as if I record my own 'turn on the light'

rock@rock-5b:~/whisper.cpp$ ./main -m ./models/ggml-tiny.en.bin -f ./samples/turnonthelight.wav -t 8
whisper_model_load: loading model from './models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 8 / 8 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing './samples/turnonthelight.wav' (46068 samples, 2.9 sec), 8 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:03.000]   turn on the light.

whisper_print_timings:     load time =   234.49 ms
whisper_print_timings:      mel time =    56.18 ms
whisper_print_timings:   sample time =     3.20 ms
whisper_print_timings:   encode time =   672.67 ms / 168.17 ms per layer
whisper_print_timings:   decode time =    65.74 ms / 16.44 ms per layer
whisper_print_timings:    total time =  1032.61 ms
rock@rock-5b:~/whisper.cpp$ ./main -m ./models/ggml-tiny.en.bin -f ./samples/turnonthelight.wav -t 8
whisper_model_load: loading model from './models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 8 / 8 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing './samples/turnonthelight.wav' (46068 samples, 2.9 sec), 8 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:03.000]   turn on the light.

whisper_print_timings:     load time =   234.23 ms
whisper_print_timings:      mel time =    49.01 ms
whisper_print_timings:   sample time =     3.17 ms
whisper_print_timings:   encode time =   665.13 ms / 166.28 ms per layer
whisper_print_timings:   decode time =   145.97 ms / 36.49 ms per layer
whisper_print_timings:    total time =  1097.83 ms

Perfect and extremely fast

andres-ramirez-duque commented 1 year ago

Hi @ggerganov

I'm following the most recent developments, but I'm a bit confused, (it is happening so quickly) the stream version that works on the raspberry and the initial_prompt option are already all integrated into the main branch?