argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
3.17k stars 267 forks source link

Use IOSurface-backed MLMultiArrays for float16 #130

Closed smpanaro closed 5 months ago

smpanaro commented 5 months ago

Copying large inputs, such as the KV cache, can add prediction latency for some device:OS combinations. On M1 Max and macOS Ventura this copying is ~25% of the prediction latency for whisper-large-v3. IOSurface-backed MLMultiArrays do not incur this copy.

Worth noting that Sonoma seems to have dramatically improved this from my testing.

Comparison from WhisperAX debug build using whisper-large-v3, M1 Max, macOS Ventura. Before image
After image