ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.38k stars 3.61k forks source link

whisper : use flash attention #2152

Closed ggerganov closed 5 months ago

ggerganov commented 5 months ago

Flash attention can now be enabled via whisper_context.flash_attn = true. Examples use the command-line argument -fa to enable the kernels (similar to llama.cpp)

Performance gains should be expected for Metal and CUDA. On the CPU, enabling FA will likely degrade the performance.

M1 Pro

CPU Config Model Th FA Enc. Dec. Bch5 PP Commit
M1 Pro METAL tiny 1 0 39.21 1.74 0.61 0.04 22c96b4
M1 Pro METAL base 1 0 70.76 2.60 0.93 0.06 22c96b4
M1 Pro METAL small 1 0 217.28 6.42 2.14 0.17 22c96b4
M1 Pro METAL medium 1 0 596.74 14.43 4.75 0.45 22c96b4
CPU Config Model Th FA Enc. Dec. Bch5 PP Commit
M1 Pro METAL tiny 1 1 30.77 1.59 0.54 0.03 22c96b4
M1 Pro METAL base 1 1 60.42 2.29 0.81 0.05 22c96b4
M1 Pro METAL small 1 1 183.82 5.12 1.81 0.14 22c96b4
M1 Pro METAL medium 1 1 517.92 11.60 4.01 0.38 22c96b4

M2 Ultra

CPU Config Model Th FA Enc. Dec. Bch5 PP Commit
M2 ULTRA METAL tiny 1 0 12.32 1.35 0.49 0.01 22c96b4
M2 ULTRA METAL tiny-q5_0 1 0 11.65 1.30 0.51 0.01 22c96b4
M2 ULTRA METAL tiny-q5_1 1 0 12.08 1.30 0.51 0.01 22c96b4
M2 ULTRA METAL base 1 0 17.58 1.90 0.76 0.02 22c96b4
M2 ULTRA METAL base-q5_0 1 0 18.89 1.86 0.79 0.02 22c96b4
M2 ULTRA METAL base-q5_1 1 0 20.69 1.88 0.79 0.02 22c96b4
M2 ULTRA METAL small 1 0 49.32 3.85 1.71 0.05 22c96b4
M2 ULTRA METAL small-q5_0 1 0 54.91 3.81 1.82 0.06 22c96b4
M2 ULTRA METAL small-q5_1 1 0 54.92 3.81 1.79 0.06 22c96b4
M2 ULTRA METAL medium 1 0 134.34 8.04 3.82 0.13 22c96b4
M2 ULTRA METAL medium-q5_0 1 0 151.68 7.59 4.07 0.14 22c96b4
M2 ULTRA METAL medium-q5_1 1 0 151.58 7.67 4.07 0.14 22c96b4
M2 ULTRA METAL medium-dis 1 0 120.82 1.07 0.41 0.02 22c96b4
M2 ULTRA METAL large-v2 1 0 235.63 12.27 5.85 0.22 22c96b4
M2 ULTRA METAL large-v2-q5_0 1 0 273.38 11.17 6.40 0.26 22c96b4
M2 ULTRA METAL large-v2-q5_1 1 0 272.44 11.32 6.29 0.26 22c96b4
M2 ULTRA METAL large-v2-dis 1 0 212.51 1.20 0.47 0.02 22c96b4
CPU Config Model Th FA Enc. Dec. Bch5 PP Commit
M2 ULTRA METAL tiny 1 1 9.07 1.33 0.45 0.01 22c96b4
M2 ULTRA METAL tiny-q5_0 1 1 9.74 1.33 0.47 0.01 22c96b4
M2 ULTRA METAL tiny-q5_1 1 1 8.93 1.31 0.46 0.01 22c96b4
M2 ULTRA METAL base 1 1 15.75 1.87 0.71 0.02 22c96b4
M2 ULTRA METAL base-q5_0 1 1 17.04 1.83 0.74 0.02 22c96b4
M2 ULTRA METAL base-q5_1 1 1 17.17 1.83 0.74 0.02 22c96b4
M2 ULTRA METAL small 1 1 42.33 3.64 1.60 0.05 22c96b4
M2 ULTRA METAL small-q5_0 1 1 47.61 3.63 1.70 0.05 22c96b4
M2 ULTRA METAL small-q5_1 1 1 47.70 3.66 1.68 0.05 22c96b4
M2 ULTRA METAL medium 1 1 114.42 7.53 3.55 0.11 22c96b4
M2 ULTRA METAL medium-q5_0 1 1 132.63 7.02 3.77 0.13 22c96b4
M2 ULTRA METAL medium-q5_1 1 1 132.28 7.10 3.76 0.13 22c96b4
M2 ULTRA METAL medium-dis 1 1 102.34 1.01 0.42 0.01 22c96b4
M2 ULTRA METAL large-v2 1 1 203.01 11.03 5.45 0.20 22c96b4
M2 ULTRA METAL large-v2-q5_0 1 1 240.05 10.18 5.98 0.23 22c96b4
M2 ULTRA METAL large-v2-q5_1 1 1 239.22 10.23 5.87 0.23 22c96b4
M2 ULTRA METAL large-v2-dis 1 1 181.14 1.14 0.48 0.02 22c96b4

Ryzen 9 5950X + RTX 2060

GPU Config Model Th FA Enc. Dec. Bch5 PP Commit
RTX 2060 AVX2 CUDA tiny 8 0 12.54 0.93 0.29 0.02 22c96b4
RTX 2060 AVX2 CUDA tiny-q5_0 8 0 12.73 0.98 0.24 0.02 22c96b4
RTX 2060 AVX2 CUDA tiny-q5_1 8 0 12.72 0.99 0.24 0.02 22c96b4
RTX 2060 AVX2 CUDA base 8 0 24.14 1.28 0.41 0.03 22c96b4
RTX 2060 AVX2 CUDA base-q5_0 8 0 24.58 1.38 0.35 0.03 22c96b4
RTX 2060 AVX2 CUDA base-q5_1 8 0 24.58 1.37 0.35 0.03 22c96b4
RTX 2060 AVX2 CUDA small 8 0 74.70 2.91 0.84 0.07 22c96b4
RTX 2060 AVX2 CUDA small-q5_0 8 0 76.12 2.84 0.77 0.08 22c96b4
RTX 2060 AVX2 CUDA small-q5_1 8 0 76.14 2.84 0.76 0.08 22c96b4
RTX 2060 AVX2 CUDA medium 8 0 200.69 6.46 1.83 0.17 22c96b4
RTX 2060 AVX2 CUDA medium-q5_0 8 0 204.80 5.90 1.65 0.19 22c96b4
RTX 2060 AVX2 CUDA medium-q5_1 8 0 205.61 5.85 1.61 0.19 22c96b4
RTX 2060 AVX2 CUDA medium-dis 8 0 186.17 0.86 0.24 0.02 22c96b4
RTX 2060 AVX2 CUDA large-v2 8 0 347.22 10.36 2.82 0.29 22c96b4
RTX 2060 AVX2 CUDA large-v2-q5_0 8 0 357.06 8.81 2.58 0.34 22c96b4
RTX 2060 AVX2 CUDA large-v2-q5_1 8 0 356.97 8.62 2.49 0.33 22c96b4
RTX 2060 AVX2 CUDA large-v2-dis 8 0 318.05 1.03 0.34 0.04 22c96b4
GPU Config Model Th FA Enc. Dec. Bch5 PP Commit
RTX 2060 AVX2 CUDA tiny 8 1 7.21 0.76 0.29 0.02 22c96b4
RTX 2060 AVX2 CUDA tiny-q5_0 8 1 7.42 0.82 0.18 0.02 22c96b4
RTX 2060 AVX2 CUDA tiny-q5_1 8 1 7.38 0.82 0.18 0.02 22c96b4
RTX 2060 AVX2 CUDA base 8 1 13.49 1.04 0.36 0.02 22c96b4
RTX 2060 AVX2 CUDA base-q5_0 8 1 13.94 1.13 0.26 0.03 22c96b4
RTX 2060 AVX2 CUDA base-q5_1 8 1 13.94 1.14 0.26 0.03 22c96b4
RTX 2060 AVX2 CUDA small 8 1 42.81 2.33 0.69 0.05 22c96b4
RTX 2060 AVX2 CUDA small-q5_0 8 1 44.43 2.25 0.59 0.06 22c96b4
RTX 2060 AVX2 CUDA small-q5_1 8 1 44.11 2.24 0.58 0.06 22c96b4
RTX 2060 AVX2 CUDA medium 8 1 115.47 5.17 1.45 0.11 22c96b4
RTX 2060 AVX2 CUDA medium-q5_0 8 1 120.37 4.63 1.25 0.13 22c96b4
RTX 2060 AVX2 CUDA medium-q5_1 8 1 120.28 4.55 1.21 0.13 22c96b4
RTX 2060 AVX2 CUDA medium-dis 8 1 101.69 0.75 0.20 0.02 22c96b4
RTX 2060 AVX2 CUDA large-v2 8 1 205.67 8.49 2.19 0.18 22c96b4
RTX 2060 AVX2 CUDA large-v2-q5_0 8 1 214.07 6.88 1.94 0.22 22c96b4
RTX 2060 AVX2 CUDA large-v2-q5_1 8 1 213.98 6.70 1.86 0.22 22c96b4
RTX 2060 AVX2 CUDA large-v2-dis 8 1 176.71 0.91 0.31 0.03 22c96b4

V100

GPU Config Model Th FA Enc. Dec. Bch5 PP Commit
V100 AVX2 CUDA tiny 1 0 6.21 1.11 0.30 0.02 22c96b4
V100 AVX2 CUDA tiny-q5_1 1 0 5.97 1.10 0.26 0.02 22c96b4
V100 AVX2 CUDA base 1 0 10.95 1.47 0.42 0.03 22c96b4
V100 AVX2 CUDA base-q5_1 1 0 11.13 1.53 0.36 0.03 22c96b4
V100 AVX2 CUDA small 1 0 31.57 2.96 0.84 0.05 22c96b4
V100 AVX2 CUDA small-q5_1 1 0 32.19 3.14 0.75 0.05 22c96b4
V100 AVX2 CUDA medium 1 0 85.88 6.49 1.80 0.10 22c96b4
V100 AVX2 CUDA medium-q5_0 1 0 87.53 5.82 1.37 0.10 22c96b4
V100 AVX2 CUDA large-v2 1 0 142.23 8.92 2.62 0.15 22c96b4
GPU Config Model Th FA Enc. Dec. Bch5 PP Commit
V100 AVX2 CUDA tiny 1 1 3.96 0.82 0.24 0.02 22c96b4
V100 AVX2 CUDA tiny-q5_1 1 1 4.05 0.85 0.18 0.02 22c96b4
V100 AVX2 CUDA base 1 1 7.21 1.16 0.36 0.02 22c96b4
V100 AVX2 CUDA base-q5_1 1 1 7.39 1.21 0.26 0.02 22c96b4
V100 AVX2 CUDA small 1 1 19.81 2.41 0.71 0.04 22c96b4
V100 AVX2 CUDA small-q5_1 1 1 20.50 2.31 0.51 0.04 22c96b4
V100 AVX2 CUDA medium 1 1 56.02 4.89 1.44 0.07 22c96b4
V100 AVX2 CUDA medium-q5_0 1 1 57.85 4.73 1.09 0.08 22c96b4
V100 AVX2 CUDA large-v2 1 1 92.73 7.18 2.14 0.10 22c96b4
ggerganov commented 5 months ago

Looking for feedback on the performance / accuracy - plan is to merge this PR and release v1.6.0

Run the tools as usual and add -fa to the command line to enable Flash Attention