Copying large inputs, such as the KV cache, can add prediction latency for some device:OS combinations. On M1 Max and macOS Ventura this copying is ~25% of the prediction latency for whisper-large-v3. IOSurface-backed MLMultiArrays do not incur this copy.
Worth noting that Sonoma seems to have dramatically improved this from my testing.
Comparison from WhisperAX debug build using whisper-large-v3, M1 Max, macOS Ventura.
Copying large inputs, such as the KV cache, can add prediction latency for some device:OS combinations. On M1 Max and macOS Ventura this copying is ~25% of the prediction latency for whisper-large-v3. IOSurface-backed MLMultiArrays do not incur this copy.
Worth noting that Sonoma seems to have dramatically improved this from my testing.