ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.38k stars 3.61k forks source link

is it possible to run openai-whisper ggml model on raspberry pi hardware? #7

Closed nyadla-sys closed 2 years ago

nyadla-sys commented 2 years ago

is it possible to run this gghml model on raspberry pi hardware?

ggerganov commented 1 year ago

@StuartIanNaylor The repetition that you observed when running main was a side-effect during the development in the stream branch. You should now start using the master branch and the repetition will not occur when using main.

When running stream, you have to now add -ac 512 to make the encoder run x3 times faster. With this option, you might get some occasional repetitions. You can try using -ac 768 on your rk3588 - it will help resolve the problem, but it will be a bit slow. Notice that the -ac option is available only for stream and not for main.

@andres-ramirez-duque Yes, everything is now on master branch. I have summarised the usage on Raspberry Pi 4 here for more visibility: https://github.com/ggerganov/whisper.cpp/discussions/166

There is now also the option -kc, --keep-context which will keep the context from the previous segment and will hopefully improve the results (thanks to @meakbiyik for adding it). I haven't tested it, so not sure if it helps yet.

StuartIanNaylor commented 1 year ago

rock@rock-5b:~/nvme/whisper.cpp$ ./stream -m ./models/ggml-tiny.en.bin -t 8 --step 4000 --length 8000 -ac 512 -kc

  (wind blowing)
 turn on the light turn
 turn off the light
 turn on the light
 turn off the light
 light
 turn on the light
 I'm going to go ahead and see what I'm going to do. I'm going to go ahead and see what I'm going to do. I'm going
 to turn off the light. Turn off the light.
 [wind blowing]
 [wind blowing] [wind blowing] [wind blowing] [wind blowing] [wind blowing]
 [wind blowing] [wind blowing] [wind blowing] [wind blowing]
 [wind blowing] [wind blowing]

That was my weirdest ever as only said one after another "turn off the light, turn on the light" Entertaining but where exactly " I'm going to go ahead and see what I'm going to do. I'm going to go ahead and see what I'm going to do. I'm going" came from I am unsure :)

./stream -m ./models/ggml-tiny.en.bin -t 8 --step 4000 --length 8000 -ac 768
rock@rock-5b:~$ uptime
 21:32:40 up 35 min,  4 users,  load average: 4.27, 2.65, 1.73

Works far far better so will stick with -ac768 as -ac 512 does seem to make things much less accuracte.

Also I have AGC running on the mic but there never seems to be a good AGC that is a bit more clever and maybe uses output or vad with better attack & delay for speech sentences. Ps just testing how things work with a bit of noise polution of spotify in the background and the way the 30 sec context works so that the sentances do have logic, it feels slightly sureal.

 China
 and on the light.
 I'm gonna forget the future of the only sound

Like Whisper is so philosphic as nobody said that apart from me and lights :)

rock@rock-5b:~/nvme/whisper.cpp$ ./stream -m ./models/ggml-base.en.bin -t 8 --step 4000 --length 8000 -ac 768
rock@rock-5b:~$ uptime
 21:59:40 up  1:02,  4 users,  load average: 6.22, 4.51, 3.14

Seem to be able to run on ggml-base.en.bin now as think that was a fail before, small.en which would be glorious now says dropping audio.

 stop playing the music
 you
 play me some music
 you
 you you you you
 you you you you you
 you you you you you you you
 you you you you you you
 you you you you you you you you
 you you you you you you you you you you you you
 you you you you you you you you you you you you
 you you you you you you you you you you you you
 you you you you you you you you you you you you
 you you you you you you you you you you you you you
 you you you you you you you you you you you you you
 you you you you you you you you you you you you you you
 you you you you you you you you you you you you you you
 you you you you you you you you you you you you you
 you you you you you you you you you you you you you
 you you you you you you you you you you you you
 you you you you you you you you you you you you you
 you you you you you you you you you you you you you
 you you you you you you you you you you you you you

I don't know where 'you' came from as wasn't said and as I sat quiet it went on a bit of a strange one with -kc enabled

@ggerganov PS the clarity and simplicty you presnt the compile options and overall layout is as awesome as what the code manages. Seriously it really is good.

On the stream build and prob because there is no Mesa Driver for the G610 and things are a bit rough and ready with a Rockchip Mali blob make stream gives.

rock@rock-5b:~/nvme/whisper.cpp$ make stream
g++ -I. -I./examples -O3 -std=c++11  -pthread examples/stream/stream.cpp ggml.o whisper.o -o stream `sdl2-config --cflags --libs` 
/usr/bin/ld: /lib/aarch64-linux-gnu/libmali.so.1: .dynsym local symbol at index 3 (>= sh_info of 3)
/usr/bin/ld: /lib/aarch64-linux-gnu/libmali.so.1: .dynsym local symbol at index 4 (>= sh_info of 3)
/usr/bin/ld: /lib/aarch64-linux-gnu/libmali.so.1: .dynsym local symbol at index 5 (>= sh_info of 3)
/usr/bin/ld: /lib/aarch64-linux-gnu/libmali.so.1: .dynsym local symbol at index 6 (>= sh_info of 3)
/usr/bin/ld: /lib/aarch64-linux-gnu/libmali.so.1: .dynsym local symbol at index 7 (>= sh_info of 3)
/usr/bin/ld: /lib/aarch64-linux-gnu/libmali.so.1: .dynsym local symbol at index 8 (>= sh_info of 3)
/usr/bin/ld: /lib/aarch64-linux-gnu/libmali.so.1: .dynsym local symbol at index 9 (>= sh_info of 3)
andres-ramirez-duque commented 1 year ago

@ggerganov I want to share with you what I plan to use your development for, in the following repository (preliminary version) you will find a simple ROS wrapper for Whisper.cpp. [https://github.com/andres-ramirez-duque/ros_stream]

The idea is to use it as a standard package within ROS that allows it to be easily integrated into any robot to, for example, use command voice to control the robot.

I integrated it as a service, so any ROS_node could make a simple call and get an audio sample (step sec) translation

ggerganov commented 1 year ago

@StuartIanNaylor @andres-ramirez-duque Thanks for the feedback. I think you might be interested in the new command tool: examples/command It's a different approach for accepting voice commands which probably could be useful for embedded devices. Explanation of how it works is available in the issue: https://github.com/ggerganov/whisper.cpp/issues/171

fquirin commented 1 year ago

Hi everybody,

I've been doing some research on different implementations of Whisper on Raspberry Pi4 (actually Pi400) to evaluate a possible integration into SEPIA STT-Server. So far I've tested the original Whisper, a TFlite version and this Cpp version. Here are some results (en_lights_4s and en_test_far_close are recorded by myself):

Whisper original (4 threads - tiny.en):

jfk.wav:            8.67s for 11.0s audio
en_lights_4s.wav:       5.49s for 3.58s audio
en_test_far_close.wav:      18.2s for 29.9s audio

Whisper TFlite (1 thread??? - tiny.en):

jfk.wav:            6.27s for 11.0s audio
en_lights_4s.wav:       5.48s for 3.58s audio
en_test_far_close.wav:      8.54s for 29.9s audio

Whisper Cpp (4 threads - ggml-tiny.en.bin):

jfk.wav:            10.23s for 11.0s audio
en_lights_4s.wav:        9.77s for 3.58s audio
en_test_far_close.wav:      29.43s for 29.9s audio

It seems I'm getting even worse results with this Cpp version than the original 🤔. Am I missing something? My command looks like this:

./main -m models/ggml-tiny.en.bin -f samples/jfk.wav -t 4

Hi @StuartIanNaylor 👋 nice to see you here too 🙂

[EDIT] Just found out how to enable 4 threads for TFlite, its even faster now ^^:

Whisper TFlite (4 threads - tiny.en):

jfk.wav:            4.32s for 11.0s audio
en_lights_4s.wav:       3.54s for 3.58s audio
en_test_far_close.wav:      6.60s for 29.9s audio
StuartIanNaylor commented 1 year ago

You prob need to make the bench and run bench-all.sh from the extra folder as https://github.com/ggerganov/whisper.cpp/issues/89# is a list of benchmark results so everybody has a like for like.

ggerganov commented 1 year ago

@fquirin The result that you get for whisper.cpp is comparable to the one I get on my RPi4.

I think the other implementations simply have a faster matrix multiplication on armv7l compared to the one that we have in whisper.cpp and therefore have better performance. Using F32 BLAS in whisper.cpp does not help, so I am not yet sure how to achieve this performance. AFAIK the TFlite model does an efficient 8-bit inference somehow which further speeds-up the computation.

Do you mind sharing the output of cat /proc/cpuinfo for the Pi400?

StuartIanNaylor commented 1 year ago

Pi400 is just a Pi4 @ 1.8Ghz on a diff board as the keyboard acts as heatsink but just a Pi4

Does compiling with optimisations make much diff as it does for the rk3588 https://github.com/ggerganov/whisper.cpp/blob/8738427dd60bda894df1ff3c12317cca2e960016/Makefile#L33

CFLAGS   = -I.              -O3 -std=c11   -march=native -ffast-math -fPIC
CXXFLAGS = -I. -I./examples -O3 -std=c++11 -march=native -ffast-math -fPIC
fquirin commented 1 year ago

The result that you get for whisper.cpp is comparable to the one I get on my RPi4.

Thanks for the info 👍

AFAIK the TFlite model does an efficient 8-bit inference somehow which further speeds-up the computation.

Actually I'm running the "normal" model since the int8 one is giving me an error: Cannot set tensor: Got value of type FLOAT32 but expected type INT64 for input 0

Do you mind sharing the output of cat /proc/cpuinfo for the Pi400?

Here you go:

processor       : 0
BogoMIPS        : 108.00
Features        : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd08
CPU revision    : 3

processor       : 1
BogoMIPS        : 108.00
Features        : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd08
CPU revision    : 3

processor       : 2
BogoMIPS        : 108.00
Features        : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd08
CPU revision    : 3

processor       : 3
BogoMIPS        : 108.00
Features        : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd08
CPU revision    : 3

Hardware        : BCM2835
Revision        : c03131
Serial          : 1000000097f4fd87
Model           : Raspberry Pi 400 Rev 1.1

Pi400 is just a Pi4 @ 1.8Ghz on a diff board as the keyboard acts as heatsink so its just a Pi4

That's what I understood as well. I hardly ever see any difference to my "normal" Pi4 boards.

Does compiling with optimisations make much diff as it does for the rk3588

What do I have to do, simply replace the lines in the Makefile? I'll give it a shot ...

fquirin commented 1 year ago

Does compiling with optimisations make much diff as it does for the rk3588

Only random fluctuations compared to the "non-optimized":

Whisper Cpp (4 threads - ggml-tiny.en.bin - built with: -march=native -ffast-math):

jfk.wav:            10.44s for 11.0s audio
en_lights_4s.wav:        9.76s for 3.58s audio
en_test_far_close.wav:      29.68s for 29.9s audio
StuartIanNaylor commented 1 year ago

Oh well if you look at the benchmarks Jinx from the Mycroft OVOS has been posting benches and he would of optimised as much as you can. On a rk3588 I get quite a big boost with the above. I am trying to clone tensorflow and its coming in at about 300kb for some reason as forgot about the usefulsensors repo as I was asking if they where going to do an instance where the decoder/encoder is split so that Arm boards with a Mali might run one on the GPU via ArmNN.

fquirin commented 1 year ago

Oh well if you look at the benchmarks Jinx from the Mycroft OVOS has been posting benches and he would of optimised as much as you can.

That would be this list? Here are my benchmark results [EDIT]:

CPU OS Config Model Th Load Enc. Commit
Raspberry Pi 400 Raspberry Pi OS Bullseye Aarch64 NEON tiny.en 4 696 8671 8738427

decoder/encoder is split so that Arm boards with a Mali might run one on the GPU via ArmNN

I always wondered if the Mali GPU would be good for anything STT/TTS related, but never saw anyone trying ^^.

StuartIanNaylor commented 1 year ago

Yeah if you look at https://github.com/StuartIanNaylor/rock5b-wav2letter-bench which is just a tidy up and rk3588 specific version of the ArmNN example

If I run on GPU

Inference End: Avg CPU%=3.5333333333333337
Runtime=0:00:02.483509
Realtime=x51.282278421378784

vs CPU

Inference End: Avg CPU%=45.88015873015866
Runtime=0:00:02.127279
Realtime=x59.86990893061042

So yeah the Mali G610-MP4 almost matches the Little.big A55/A76 8 core...

fquirin commented 1 year ago

So yeah the Mali G610-MP4 almost matches the Little.big A55/A76 8 core...

wohaaa 😲. I need to read about ArmNN I think ^^

Looking at my benchmark results I wonder why I get comparable results with NEON and not NEON BLAS. How would I activate "BLAS"?

StuartIanNaylor commented 1 year ago

make clean and then just prefix make with WHISPER_OPENBLAS=1 make ... I think from memory but have the openblas libs installed

finally downloaded and guess these are single thread, did you use time to get your results as inference time seems to just be an integer

orangepi@orangepi5:~/openai-whisper/minimal_build$ time ./minimal ../models/whisper.tflite ../samples/jfk.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 2 seconds

[_SOT_][_NOT_] And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

real    0m4.395s
user    0m4.285s
sys     0m0.110s

orangepi@orangepi5:~/openai-whisper/minimal_build$ time ./minimal ../models/whisper.tflite ../samples/test.wav

n_vocab:50257
]
mel.n_len3000

mel.n_mel:80
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 3 seconds

[_SOT_][_NOT_] Bili always listens to his mother. He always does what she says. If his mother says, brush your teeth, Bili brushes his teeth. If his mother says, go to bed, Bili goes to bed. Bili is a very good boy, a good boy listens to his mother. His mother does not have to ask him again. She asks him to do something one time and she does not ask again. Bili is a good boy. He does what his mother asks the first time. She does not have to ask again.

real    0m5.634s
user    0m5.477s
sys     0m0.157s

orangepi@orangepi5:~/openai-whisper/minimal_build$ time ./minimal ../models/whisper.tflite ../samples/test_1.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 3 seconds

[_SOT_][_NOT_] David lost his yellow pencil. He could not find it. Where is my yellow pencil? Yes his sister. His sister did not know. I don't know where your pencil is. She said David thought about it. He thought and thought. He used his yellow pencil before lunch. He used it to write a note to his teacher. The notes said, dear teacher, thank you for helping me, David. He put the note in the envelope where was the envelope?

real    0m5.402s
user    0m5.278s
sys     0m0.124s

Whisper.cpp

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:07.740]   And so my fellow Americans ask not what your country can do for you
[00:00:07.740 --> 00:00:10.740]   ask what you can do for your country

whisper_print_timings:     load time =   249.38 ms
whisper_print_timings:      mel time =   167.99 ms
whisper_print_timings:   sample time =    25.07 ms
whisper_print_timings:   encode time =  2514.62 ms / 628.65 ms per layer
whisper_print_timings:   decode time =   291.98 ms / 73.00 ms per layer
whisper_print_timings:    total time =  3257.35 ms

With -march=native -ffast-math

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:07.740]   And so my fellow Americans ask not what your country can do for you
[00:00:07.740 --> 00:00:10.740]   ask what you can do for your country

whisper_print_timings:     load time =   245.95 ms
whisper_print_timings:      mel time =   115.94 ms
whisper_print_timings:   sample time =    21.79 ms
whisper_print_timings:   encode time =  1145.98 ms / 286.50 ms per layer
whisper_print_timings:   decode time =   225.26 ms / 56.32 ms per layer
whisper_print_timings:    total time =  1763.36 ms
fquirin commented 1 year ago

make clean and then just prefix make with WHISPER_OPENBLAS=1 make ... I think from memory but have the openblas libs installed

That actually did some magic, ... not much but a bit:

CPU OS Config Model Th Load Enc. Commit
Raspberry Pi 400 Raspberry Pi OS Bullseye Aarch64 NEON BLAS tiny.en 4 711 7539 8738427

finally downloaded and guess these are single thread, did you use time to get your results as inference time seems to just be an integer

I used a Python script to run it: test2.zip python3 test2.py -h

I suggest to use the same jfk.wav for testing to compare results since file length and inference seem to have weird relation

[EDIT] I just realize it was the same 😅. So your TFlite is actually slower than cpp?

StuartIanNaylor commented 1 year ago
Loading audio file: ./samples/test.wav
Samplerate: 16000, length: 30.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 Bili always listens to his mother. He always does what she says. If his mother says,

Inference took 1.29s for 30.0s audio file.

Loading audio file: ./samples/test_1.wav
Samplerate: 16000, length: 30.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 David lost his yellow pencil. He could not find it. Where is my yellow pencil? He asked his sister. His sister did not know. I don't know where your pencil is. She said David thought about it. He thought and thought. He used his yellow pencil for before lunch. He used it to write a note to his teacher. The notes said, dear teacher, thank you for helping me, David. He put the note in the envelope where was the envelope?

Inference took 2.21s for 30.0s audio file.

Loading audio file: ./samples/jfk.wav
Samplerate: 16000, length: 11.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

Inference took 1.09s for 11.0s audio file.
fquirin commented 1 year ago

Inference took 1.29s for 30.0s audio file. ... Inference took 1.09s for 11.0s audio file.

Wait, what?? 😲 😱 🤯 Is this the pure power of the OrangePi5 or do you have some package optimizations compared to Pi4 as well ^^?

StuartIanNaylor commented 1 year ago

rk3588 a76/a55 prob because arm v.8,2 as The A76 got added SDOT and UDOT instructions provide access to many multiply and accumulate operations every cycle so greatly speed up ML, which also helps Apple but they even added further instructions.

Its just 4 gens up & 2.4 Ghz

fquirin commented 1 year ago

rk3588 a76/a55 prob because arm v.8,2 as The A76 got added SDOT and UDOT instructions provide access to many multiply and accumulate operations every cycle so greatly speed up ML

Very impressive 😎 , ... I should get one too 😆. Hopefully a possible RPi5 will see the same improvements ^^

StuartIanNaylor commented 1 year ago

I used to call OrangePi one of the strange fruits as the image support and availability was fairly stinky, all RK3588 are actually running a modified android kernel tree 5.10 as still very fresh and still a lot more kernel submission needed. I am finding it very solid and having to reevaluate OrangePi as waiting for a delivery of an OrangePi02 as that is a $20 4x A53 with a MaliG31Mp2 $30 with taxes and shipping as again its interesting with ArmNN what the Mali can provide.

The RK3588 also has a 3 core 2Tops NPU so yeah they are great for a relatively budget ML powerhouse that overall could be something approx x20 ML perf over a Pi4 and would suggest 8gb as a minimum and there is a choice as rk3588s basically is cheaper as it casts off a Gen3.0x4 PCIe M.2 that the full rk3588 has. So OrangePi is the cheaper rk3588s and the Radxa Rock5b has x2 M.2 connectors with the 2nd for a full speed NVME or even another 26 Tops AI accelerator such as the Hailo-8 M.2. The Mac M1 mini 16gb is probably the ultimate 2001 Home Hal kit as its idle wattage is insane considering it can ramp up to RTX2080ti + ML perf but CoreML & Metal prob means Georgi's Whisper.cc has a limited lifespan on that platform without GPU support the 8 core GPU is idle, but think tflite maybe be getting a boost as the weights file is 8bit.

PS ArmNN is for both CPU & GPU and on Arm supposedly the fastest ML framework so Arm say. Dunno about the Pi5 as with Broadcom, Raspberry are in a bad place at the moment and don't have a lot of choice and Arm is very partisan because Broadcom was one of the biggest backers of the failed Nvidia takeover.

PS one thing that has got me head scratching is tf.lite via tf is faster than the tflite runtime which is weird?

fquirin commented 1 year ago

I checked some online shops today but it seems the Orange Pi 5 is a bit hard to get in EU right now. Best offer I could find was Amazon US for 120$ (83$ + shipping + tax) which doesn't yet include the postal service fee for handling customs :-|. I'll probably just wait a few months to see what happens ^^.

PS ArmNN is for both CPU & GPU and on Arm supposedly the fastest ML framework so Arm say

It looks very promising, but the question is: will anyone support it? It looks like there is some Tensorflow integration, does that mean it could accelerate TF code just by switching some packages?

Dunno about the Pi5 as with Broadcom, Raspberry are in a bad place at the moment and don't have a lot of choice

You think they will be stuck with Broadcom? Well, we'll see what they come up with this year. I don't expect to see any signs of a Pi5 before Q3 anyways.

PS one thing that has got me head scratching is tf.lite via tf is faster than the tflite runtime which is weird?

Did you test that with my Whisper Python TF test code? What is the difference in speed?

StuartIanNaylor commented 1 year ago

On avg the same 1sec with full TF 4 threads then 1.6 sec same but the runtime.

import os
from timeit import default_timer as timer
import wave
import argparse
#import json

print(f'Importing tensorflow, numpy and torch')
import tflite_runtime.interpreter as tflite
import numpy as np
#import torch

print(f'Importing whisper')
import whisper

parser = argparse.ArgumentParser(description="Running Whisper TFlite test inference.")
parser.add_argument("-f", "--folder", default="./test_wavs/", help="Folder with WAV input files")
parser.add_argument("-m", "--model", default="models/whisper.tflite", help="Path to model")
parser.add_argument("-t", "--threads", type=int, default=2, help="Threads used")
args = parser.parse_args()

model_path = args.model
print(f'Loading tflite model {model_path} ...')
interpreter = tflite.Interpreter(model_path, num_threads=args.threads)
interpreter.allocate_tensors()

def transcribe(audio_file):
    print(f'\nLoading audio file: {audio_file}')
    wf = wave.open(audio_file, "rb")
    sample_rate_orig = wf.getframerate()
    audio_length = wf.getnframes() * (1 / sample_rate_orig)
    if (wf.getnchannels() != 1 or wf.getsampwidth() != 2
        or wf.getcomptype() != "NONE" or sample_rate_orig != 16000):
        print("Audio file must be WAV format mono PCM.")
        exit (1)
    wf.close()
    print(f'Samplerate: {sample_rate_orig}, length: {audio_length}s')

    inference_start = timer()

    print(f'Calculating mel spectrogram...')
    mel_from_file = whisper.audio.log_mel_spectrogram(audio_file)
    input_data = whisper.audio.pad_or_trim(mel_from_file, whisper.audio.N_FRAMES)
    input_data = np.expand_dims(input_data, 0)
    #print("Input data shape:", input_data.shape)

    #input_data = np.frombuffer(wf.readframes(wf.getnframes()), np.int16)
    #input_data = np.random.randn(1, 256, 256, 3)

    input_details = interpreter.get_input_details()
    interpreter.resize_tensor_input(input_details[0]['index'], input_data.shape)

    interpreter.set_tensor(input_details[0]['index'], input_data)

    print("Invoking interpreter ...")
    interpreter.invoke()

    print("Preparing output data ...")
    output_details = interpreter.get_output_details()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    #output_data = output_data.squeeze()
    #print(output_data)
    #np.savetxt("output.txt", output_data)
    #print(interpreter.get_output_details()[0])

    # convert tokens to text
    print("Converting tokens ...")
    wtokenizer = whisper.tokenizer.get_tokenizer(False, language="en")
    for token in output_data:
        #print(token)
        token[token == -100] = wtokenizer.eot
        text = wtokenizer.decode(token, skip_special_tokens=True)
        print(text)

    print("\nInference took {:.3}s for {:.3}s audio file.".format(
        timer() - inference_start, audio_length))

test_files = os.listdir(args.folder)
for file in test_files:
    if file.endswith(".wav"):
        transcribe(args.folder + file)

Dunno?

OrangePi I think 1st batch has gone and suppliers that had a few have bumped the price until new official batch released. Maybe I was just lucky and $80 delivered was just a 1st batch sweetner

fquirin commented 1 year ago

In my case tflite_runtime seems to be 1-1.5s slower in average 🙈 . Looks like every hardware has its own sweet-spot .. how annoying :-|.

Maybe I was just lucky and $80 delivered was just a 1st batch sweetner

I guess we'll see how the price evolves, currently SBC prices are totally crazy anyway =)

StuartIanNaylor commented 1 year ago

Nope I am the same runtime is slower. It may be lighter on the system and deliberate and that is your choice.

RK3588's might stay as that as a huge step up, but still with Sota models it struggles as really due to accuracy falling off a cliff the smaller models really 'small' should be the baseline with Whisper and would be interested how that runs if converted. The is a whole grey area between microcontroller and central AI processor where much currently is a bad fit as never been a fan of Apple but boy did they get it right with the M1 even with the asymetric cores of a race to idle type design.

If the decoder/encoder was split into 2 models running cpu/gpu or cpu/npu then things would likely be a better fit, but I would likely have to go back to the 90's where hardware is lagging so far behind software design to those Win95 days, so all very interesting for a geek.

fquirin commented 1 year ago

Nope I am the same runtime is slower. It may be lighter on the system and deliberate and that is your choice.

oh I see, I must have mixed up your results. I think the tflite_runtime i smissing some optimizations, as it is considerably slower on my desktop PC as well. But as you said, it is incredibly lightweight in comparison (like a GB smaller installation 😆).

never been a fan of Apple but boy did they get it right with the M1

yeah its truly a crazy beast 🤩 My reference always is the Google offline ASR black-box on Android. I don't know what they do, but the model is incredibly small, precise and fast, even on my old 2017 mid-class Samsung A3, so I'm sure there is a lot of room for performance tweaking in the current open-source models.

j1nx commented 1 year ago

Running on OpenVoiceOS, RaspberryPi 4 - 2GB model.

With the tiny model;

mycroft@OpenVoiceOS-e3830c:~/whisper $ python3 test.py -f samples/ -m models/whisper.tflite -t 4
Importing tensorflow, numpy and torch
Importing whisper
Loading tflite model models/whisper.tflite ...
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

Loading audio file: samples/test.wav
Samplerate: 16000, length: 30.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 Bili always listens to his mother. He always does what she says. If his mother says,

Inference took 4.74s for 30.0s audio file.

Loading audio file: samples/test_1.wav
Samplerate: 16000, length: 30.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 David lost his yellow pencil. He could not find it. Where is my yellow pencil? He asked his sister. His sister did not know. I don't know where your pencil is. She said David thought about it. He thought and thought. He used his yellow pencil for before lunch. He used it to write a note to his teacher. The notes said, dear teacher, thank you for helping me, David. He put the note in the envelope where was the envelope?

Inference took 8.57s for 30.0s audio file.

Loading audio file: samples/jfk.wav
Samplerate: 16000, length: 11.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

Inference took 4.28s for 11.0s audio file.
nyadla-sys commented 1 year ago

Running on OpenVoiceOS, RaspberryPi 4 - 2GB model.

With the tiny model;

mycroft@OpenVoiceOS-e3830c:~/whisper $ python3 test.py -f samples/ -m models/whisper.tflite -t 4
Importing tensorflow, numpy and torch
Importing whisper
Loading tflite model models/whisper.tflite ...
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

Loading audio file: samples/test.wav
Samplerate: 16000, length: 30.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 Bili always listens to his mother. He always does what she says. If his mother says,

Inference took 4.74s for 30.0s audio file.

Loading audio file: samples/test_1.wav
Samplerate: 16000, length: 30.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 David lost his yellow pencil. He could not find it. Where is my yellow pencil? He asked his sister. His sister did not know. I don't know where your pencil is. She said David thought about it. He thought and thought. He used his yellow pencil for before lunch. He used it to write a note to his teacher. The notes said, dear teacher, thank you for helping me, David. He put the note in the envelope where was the envelope?

Inference took 8.57s for 30.0s audio file.

Loading audio file: samples/jfk.wav
Samplerate: 16000, length: 11.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

Inference took 4.28s for 11.0s audio file.

Uploaded whisper-small model which is multilingual and u may need to use filters_vocab_multilingual.bin https://github.com/usefulsensors/openai-whisper/blob/main/models/whisper-small.tflite

nyadla-sys commented 1 year ago

@j1nx Please provide benchmark for whisper-small model

fquirin commented 1 year ago

@nyadla-sys do you have an example of how to use filters_vocab_multilingual.bin and what it does?

Btw I've better organized the Python example in this repository now and did some smaller updates.

@ggerganov Just some useful info maybe: The Tflite version is much faster on the Raspberry Pi compared to the Cpp version, but on my Intel Core i3-12300T it is exactly the other way around (put some benchmarks in the thread)

nyadla-sys commented 1 year ago

@fquirin To use a multilingual model in Python, you can simply change the line "wtokenizer = whisper.tokenizer.get_tokenizer(False, language="en")" to "wtokenizer = whisper.tokenizer.get_tokenizer(True, language="en")".

Note that filters_vocab_multilingual.bin is mainly for C++ applications.

Refer README to know more details on how to use filters_vocab_gen.bin

nyadla-sys commented 1 year ago

@nyadla-sys do you have an example of how to use filters_vocab_multilingual.bin and what it does?

Btw I've better organized the Python example in this repository now and did some smaller updates.

@ggerganov Just some useful info maybe: The Tflite version is much faster on the Raspberry Pi compared to the Cpp version, but on my Intel Core i3-12300T it is exactly the other way around (put some benchmarks in the thread)

@fquirin Thanks for your work for putting neat Python example to use whisper tflite model

nyadla-sys commented 1 year ago

@nyadla-sys do you have an example of how to use filters_vocab_multilingual.bin and what it does?

Btw I've better organized the Python example in this repository now and did some smaller updates.

@ggerganov Just some useful info maybe: The Tflite version is much faster on the Raspberry Pi compared to the Cpp version, but on my Intel Core i3-12300T it is exactly the other way around (put some benchmarks in the thread)

@fquirin TFLite framework kernels are very well optimized on Arm cores and on Mac books

j1nx commented 1 year ago

@nyadla-sys

With the minimal binary;

mycroft@OpenVoiceOS-e3830c:~/whisper $ minimal models/whisper-small.tflite samples/jfk.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
ERROR: gather index out of bounds
ERROR: Node number 35 (GATHER) failed to invoke.
ERROR: Node number 3435 (WHILE) failed to invoke.
Error at ../minimal.cc:211

And with the test.py;

mycroft@OpenVoiceOS-e3830c:~/whisper $ python3 test.py -f samples/ -m models/whisper-small.tflite -t 4
Importing tensorflow, numpy and torch
Importing whisper
Loading tflite model models/whisper-small.tflite ...
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

Loading audio file: samples/test.wav
Samplerate: 16000, length: 30.0s
Calculating mel spectrogram...
Invoking interpreter ...
Traceback (most recent call last):
  File "/home/mycroft/whisper/test.py", line 80, in <module>
    transcribe(args.folder + file)
  File "/home/mycroft/whisper/test.py", line 55, in transcribe
    interpreter.invoke()
  File "/usr/lib/python3.10/site-packages/tflite_runtime/interpreter.py", line 917, in invoke
    self._interpreter.Invoke()
RuntimeError: gather index out of boundsNode number 35 (GATHER) failed to invoke.Node number 3435 (WHILE) failed to invoke.
StuartIanNaylor commented 1 year ago

@nyadla-sys

Is it possible to add

  tflite::ops::builtin::BuiltinOpResolver resolver;
  tflite::InterpreterBuilder builder(*model, resolver);
  std::unique_ptr<tflite::Interpreter> interpreter;
  builder(&interpreter);
  interpreter->SetNumThreads(4);
  TFLITE_MINIMAL_CHECK(interpreter != nullptr);

interpreter->SetNumThreads(4); like the above but pass it as an arg?

j1nx commented 1 year ago

Indeed, I believe the C++ minimal binary is not using 4 threads.

BTW: Perhaps move this Python TFlite stuff over to; https://github.com/usefulsensors/openai-whisper/issues/15

nyadla-sys commented 1 year ago

Indeed, I believe the C++ minimal binary is not using 4 threads.

BTW: Perhaps move this Python TFlite stuff over to; usefulsensors/openai-whisper#15

Actually minimal.cc is not using threads ,in order to use the threads please use the below code

tflite::ops::builtin::BuiltinOpResolver resolver; tflite::InterpreterBuilder builder(*model, resolver); std::unique_ptr interpreter; builder(&interpreter); const auto processor_count = std:: thread ::hardware_concurrency(); interpreter->SetNumThreads(processor_count); TFLITE_MINIMAL_CHECK(interpreter != nullptr);

fquirin commented 1 year ago

To use a multilingual model in Python, you can simply change the line "wtokenizer = whisper.tokenizer.get_tokenizer(False, language="en")" to "wtokenizer = whisper.tokenizer.get_tokenizer(True, language="en")"

Interesting, ty! Does that mean output_data is actually language independent?

nyadla-sys commented 1 year ago

To use a multilingual model in Python, you can simply change the line "wtokenizer = whisper.tokenizer.get_tokenizer(False, language="en")" to "wtokenizer = whisper.tokenizer.get_tokenizer(True, language="en")"

Interesting, ty! Does that mean output_data is actually language independent? Yes,I guess

ggerganov commented 1 year ago

@fquirin @nyadla-sys I guess TFLite might be using the Raspberry Pi 4's GPU/QPUs to gain this performance. Is there a way to tell it to run only on the CPU?

StuartIanNaylor commented 1 year ago

@ggerganov I am not sure there is that much difference in performance to be honest but no tensorflow doesn't use a raspberries GPU and if it did its actually pretty poor as its actually not a GPU its a general purpose DSP running code, in fact its what loads the OS as a Pi is really a VideoDSP with a Arm chip on top. The measurements need to be like for like which maybe they are not but overall things look very similar and maybe tensorflow is getting a boost because the weights file is 8bit but doesn't seem to provide much, but needs a far better benchmark and outputs than its getting currently. Using the Neon is faster than the GPU especially on the Pi4's which it prob uses for Dot products using OpenBlas For MAC's. The A76 got added SDOT and UDOT instructions provide access to many multiply and accumulate operations every cycle so greatly speed up ML which is Arm v8.2 and I guess Tensorflow and libs that use older Arm architecture have optimised routines (Neon) for missing, that are not implemented in GGML as they now in silicon on more current Socs and chosen by the compiler.

j1nx commented 1 year ago

Just compared all three different whisper infrerence methods on theRaspberry Pi4 - 2GB model running OpenVoiceOS (which basically is minimal embedded OS created with Buildroot 2022.02)

With whisper.cpp 1.1.0;

mycroft@OpenVoiceOS-e3830c:~/whisper $ /usr/bin/whispercpp/main -t 4 -m models/ggml-tiny.en.bin -f samples/mm0.wav
whisper_init_from_file: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  387.00 MB (+    3.00 MB per decoder)
whisper_model_load: kv self size  =    2.62 MB
whisper_model_load: kv cross size =    8.79 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

main: processing 'samples/mm0.wav' (478214 samples, 29.9 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:01.360]   This is the micro machine man, presenting the most
[00:00:01.360 --> 00:00:03.200]   midget miniature motorcade of micro machine.
[00:00:03.200 --> 00:00:04.760]   Each one has dramatic details for a picture of a decision
[00:00:04.760 --> 00:00:06.280]   page on a plus incredible micro machine pocket
[00:00:06.280 --> 00:00:07.600]   place that's physical police station fire station
[00:00:07.600 --> 00:00:08.680]   restaurant service station and more.
[00:00:08.680 --> 00:00:10.200]   Perfect pocket portable to take any place.
[00:00:10.200 --> 00:00:11.440]   And there are many miniature places to play with.
[00:00:11.440 --> 00:00:13.080]   Each one comes with its own special edition micro machine
[00:00:13.080 --> 00:00:15.120]   vehicle and fantastic features that miraculously move.
[00:00:15.120 --> 00:00:16.640]   Raise the boat lift at the airport, Marina Mann,
[00:00:16.640 --> 00:00:17.960]   the gun turret at the Army, based clean your car
[00:00:17.960 --> 00:00:19.120]   at the car, watch, raise the toll bridge.
[00:00:19.120 --> 00:00:20.320]   And these place that's fitted together to form
[00:00:20.320 --> 00:00:21.960]   a micro machine world micro machine pocket place
[00:00:21.960 --> 00:00:23.360]   that's such a menacing tiny so perfectly precise.
[00:00:23.360 --> 00:00:25.160]   So, Dazen, we detailed Joe on a pocket them all.
[00:00:25.160 --> 00:00:26.360]   Micro machines and micro machine pocket
[00:00:26.360 --> 00:00:27.640]   place that's sold separately from Glu.
[00:00:27.640 --> 00:00:29.640]   The smaller they are, the better they are.

whisper_print_timings:     load time =   729.03 ms
whisper_print_timings:      mel time =   527.18 ms
whisper_print_timings:   sample time =   381.33 ms
whisper_print_timings:   encode time = 19210.24 ms / 4802.56 ms per layer
whisper_print_timings:   decode time =  6497.03 ms / 1624.26 ms per layer
whisper_print_timings:    total time = 27374.97 ms

With whisper-tflite c++ version;

mycroft@OpenVoiceOS-e3830c:~/whisper $ minimal models/whisper.tflite samples/mm0.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 14 seconds

[_SOT_][_NOT_] This is the micro machine man presenting the most miniature motivator of micro machine. Each one has dramatic details for if it's in a position page on a plus incredible micro machine pocket place that's physical police station fire station restaurant service station and more. Perfect pocket portable to take any place. And there are many miniature places to play with each one comes with its own special edition micro machine vehicle and fantastic features that miraculously move. Raise the bolt, lift it, the airport, marina man, the gun turret at the army, there's clean your car at the car, watch, raise the toll bridge. And these place that's fitted together to form a micro machine world micro machine pocket place that's so tremendously tiny so perfectly precise. So, doesn't we detail Joe on a pocket them all micro machines and micro machine pocket place that's sold separately from Glu. The smaller they are, the better they are.

And with the whisper-tflite model using Python;

mycroft@OpenVoiceOS-e3830c:~/whisper $ python3 test.py -f samples/ -m models/whisper.tflite -t 4 -r 2
Importing tflite_runtime
Importing numpy
Importing whisper
Loading tflite model models/whisper.tflite ...
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

Loading audio file: samples/mm0.wav
Samplerate: 16000, length: 29.888375s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 This is the micro machine man presenting the most miniature motorcade of micro machine. Each one has dramatic details for if it's in the position page on. Plus, incredible micro machine pocket place. That's physical police station fire station restaurant service station and more. Perfect pocket portable to take any place. And there are many miniature places to play with. Each one comes with its own special edition micro machine vehicle and fantastic features that miraculously move. Raise the bolt. Lift it. The airport, Marina, man. The gun turret at the army base. Clean your car at the car. Raise it over. And these place sets fit together to form a micro machine world. Micro machine pocket place that's so tremendously tiny, so perfectly precise. So, doesn't we detailed Joe on a pocket them all? Micro machines and micro machine pocket place that's sold separately from Glube. The smaller they are, the better they are.

Inference took 14.03s for 29.89s audio file.
j1nx commented 1 year ago

And the same with a smaller audio sample.

whisper.cpp

mycroft@OpenVoiceOS-e3830c:~/whisper $ /usr/bin/whispercpp/main -t 4 -m models/ggml-tiny.en.bin -f samples/jfk.wav
whisper_init_from_file: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  387.00 MB (+    3.00 MB per decoder)
whisper_model_load: kv self size  =    2.62 MB
whisper_model_load: kv cross size =    8.79 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:07.740]   And so my fellow Americans ask not what your country can do for you
[00:00:07.740 --> 00:00:10.740]   ask what you can do for your country

whisper_print_timings:     load time =   735.46 ms
whisper_print_timings:      mel time =   196.64 ms
whisper_print_timings:   sample time =    45.40 ms
whisper_print_timings:   encode time =  9397.06 ms / 2349.27 ms per layer
whisper_print_timings:   decode time =   730.02 ms / 182.51 ms per layer
whisper_print_timings:    total time = 11128.93 ms

Whisper-tflite c++;

mycroft@OpenVoiceOS-e3830c:~/whisper $ minimal models/whisper.tflite samples/jfk.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 5 seconds

[_SOT_][_NOT_] And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

Whisper-tflite python intepreter;

mycroft@OpenVoiceOS-e3830c:~/whisper $ python3 test.py -f samples/ -m models/whisper.tflite -t 4 -r 2
Importing tflite_runtime
Importing numpy
Importing whisper
Loading tflite model models/whisper.tflite ...
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

Loading audio file: samples/jfk.wav
Samplerate: 16000, length: 11.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

Inference took 4.44s for 11.00s audio file.
ggerganov commented 1 year ago

@StuartIanNaylor @j1nx Thanks for the summary. But my hypothesis still remains to be confirmed or rejected.

Based on this link, TFLite will use available GPU or any hardware accelerator available on the device: https://www.tensorflow.org/lite/performance/delegates

I don't agree that the RPi4 GPU is poor. This link claims 13.5 - 32.0 GFLOPS for the Broadcom VideoCore VI GPU:

https://www.howtogeek.com/devops/raspberry-pi-4-good-enough-for-gaming/#:~:text=Raspberry%20Pi%204:%20GPU%20speed&text=The%20standard%20GPU%20in%20a,(estimates%20and%20calculations%20vary).

This presentation confirms the numbers using RPi4 QPUs (Quad Processing Units) and demonstrate significant performance improvement compared to CPU-only computation:

https://www.cs.ucr.edu/~mchow009/teaching/cs193/spring2021/slides/Raspberry_Pi_QPU.pdf

So the question remains: Are we comparing pure CPU NEON implementation (whisper.cpp) vs hardware accelerated one (TFLite)?

It is important, because if TFLite is really CPU-only, then there must be some optimization that I am missing and maybe I can add it to whisper.cpp

nyadla-sys commented 1 year ago

@ggerganov at this time tflite on rpi4 uses only cpu and neon optimizations, in order get better performance we can offload tflite kernels to gpu using gpu delegate api. which is not done yet. If we do that the performance is much better than what we see today on rpi4

fquirin commented 1 year ago

mycroft@OpenVoiceOS-e3830c:~/whisper $ python3 test.py -f samples/ -m models/whisper.tflite -t 4 -r 2 Importing tflite_runtime ... And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

Inference took 4.44s for 11.00s audio file.

@j1nx Did you change something for tflite_runtime? Because I'm still getting:

#tflite_runtime (v2.11.0):
 And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

Inference took 5.74s for 11.00s audio file.

And with tensorflow.lite:

#tensorflow.lite (v2.11.0 + tflite v2.10.0):
 And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

Inference took 4.17s for 11.00s audio file.
fquirin commented 1 year ago

If we do that the performance is much better than what we see today on rpi4

I guess that remains to be seen since I tend to agree with Stuarts impression of the GPU, at least I haven't seen any acceleration of ML libs yet for the Pi. But 13.5 - 32.0 GFLOPS for the Broadcom VideoCore VI GPU doesn't sound too bad 🤔 .

What is required to do that?

[EDIT]

This presentation confirms the numbers using RPi4 QPUs (Quad Processing Units) and demonstrate significant performance improvement compared to CPU-only computation:

https://www.cs.ucr.edu/~mchow009/teaching/cs193/spring2021/slides/Raspberry_Pi_QPU.pdf

Very interesting! 👍 🚀

j1nx commented 1 year ago

@ggerganov @fquirin

This is the buildroot script I use to build the tflite runtime (both python and c-api) within OpenVoiceOS; https://github.com/OpenVoiceOS/ovos-buildroot/blob/develop/buildroot-external/package/tensorflow-lite/tensorflow-lite.mk

Important is to say, that I build tflite with cmake for a systemwide installation, so stuff like Flatbuffers and such are not pulled in by the CMake of tensorflow-lite but seperately build and set as dependency for that package. Should not be inportant as I pull in the same version as tflite.

Here are all the different configuration options; https://github.com/OpenVoiceOS/ovos-buildroot/blob/develop/buildroot-external/package/tensorflow-lite/tensorflow-lite.mk#L29

Couple of things;

As you can see within the outputs, tflite uses the XNNpack delegate.

INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

Which according of Tensorflow guys is the best performing one on embedded ARM devices and X86 under Linux for CPU. https://towardsdatascience.com/accelerating-tensorflow-lite-with-xnnpack-ece7dc8726d0

Tensorflow lite runtime will automatically select the best delegte for the system. That is why cpuinfo is a requirement. And as you see in all the above, It selects the XNNpack.

MOre info on the different delegates; https://www.tensorflow.org/lite/performance/delegates

j1nx commented 1 year ago

Running the model benchmark tool on the whisper.tflite (tiny.en) model;

mycroft@OpenVoiceOS-e3830c:~/whisper $ ./linux_aarch64_benchmark_model --graph=models/whisper.tflite --num_threads=4
STARTING!
Log parameter values verbosely: [0]
Num threads: [4]
Graph: [models/whisper.tflite]
#threads used for CPU inference: [4]
Loaded model models/whisper.tflite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
The input model file size (MB): 40.9627
Initialized session in 14.651ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=2912644

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=2756885 curr=2770642 min=2729604 max=2783775 avg=2.75473e+06 std=11882

Inference timings in us: Init: 14651, First inference: 2912644, Warmup (avg): 2.91264e+06, Inference (avg): 2.75473e+06
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=10.3906 overall=252.566

And on the whisper-small.tflite (small-en) model;

mycroft@OpenVoiceOS-e3830c:~/whisper $ ./linux_aarch64_benchmark_model --graph=models/whisper-small.tflite --num_threads=4
STARTING!
Log parameter values verbosely: [0]
Num threads: [4]
Graph: [models/whisper-small.tflite]
#threads used for CPU inference: [4]
Loaded model models/whisper-small.tflite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
The input model file size (MB): 248.401
Initialized session in 627.504ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
ERROR: gather index out of bounds
ERROR: Node number 35 (GATHER) failed to invoke.
ERROR: Node number 3435 (WHILE) failed to invoke.
count=1 curr=23827035

Benchmarking failed.

However we already discovered something is not right for/with the small model; https://github.com/usefulsensors/openai-whisper/issues/15

j1nx commented 1 year ago

Normal CPU and GPU (non opencl, normal instructions)

mycroft@OpenVoiceOS-e3830c:~/whisper $ ./linux_aarch64_benchmark_model_performance_options --graph=models/whisper.tflite --num_threads=4 --warmup_runs=1 --num_runs=50
STARTING!
The list of TFLite runtime options to be benchmarked: [all]
Log parameter values verbosely: [0]
Min num runs: [50]
Num threads: [1]
Min warmup runs: [1]
Graph: [models/whisper.tflite]
#threads used for CPU inference: [1]
Use gpu: [0]
Use Hexagon: [0]
Use xnnpack: [0]
Loaded model models/whisper.tflite
The input model file size (MB): 40.9627
Initialized session in 12.397ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=5334923

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=32 first=4779314 curr=4689512 min=4679189 max=4779314 avg=4.69509e+06 std=16547

Inference timings in us: Init: 12397, First inference: 5334923, Warmup (avg): 5.33492e+06, Inference (avg): 4.69509e+06
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=10.125 overall=252.926
Log parameter values verbosely: [0]
Min num runs: [50]
Num threads: [1]
Min warmup runs: [1]
Graph: [models/whisper.tflite]
Max initial profiling buffer entries: [1801]
#threads used for CPU inference: [1]
Use gpu: [1]
Use Hexagon: [0]
Use xnnpack: [0]
Loaded model models/whisper.tflite
The GPU delegate compile options are only supported on Android or iOS platforms or when the tool was built with -DCL_DELEGATE_NO_GL.
GPU acceleration is unsupported on this platform.
The input model file size (MB): 40.9627
Initialized session in 29.175ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=4843822

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=33 first=4708547 curr=4679901 min=4672405 max=4709786 avg=4.68214e+06 std=8533

Inference timings in us: Init: 29175, First inference: 4843822, Warmup (avg): 4.84382e+06, Inference (avg): 4.68214e+06
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=0 overall=29.5586

Twice as slow