Flash attention can now be enabled via whisper_context.flash_attn = true.
Examples use the command-line argument -fa to enable the kernels (similar to llama.cpp)
Performance gains should be expected for Metal and CUDA. On the CPU, enabling FA will likely degrade the performance.
Flash attention can now be enabled via
whisper_context.flash_attn = true
. Examples use the command-line argument-fa
to enable the kernels (similar tollama.cpp
)Performance gains should be expected for Metal and CUDA. On the CPU, enabling FA will likely degrade the performance.
M1 Pro
M2 Ultra
Ryzen 9 5950X + RTX 2060
V100