Use IOSurface-backed MLMultiArrays for float16

Copying large inputs, such as the KV cache, can add prediction latency for some device:OS combinations. On M1 Max and macOS Ventura this copying is ~25% of the prediction latency for whisper-large-v3. IOSurface-backed MLMultiArrays do not incur this copy.

Worth noting that Sonoma seems to have dramatically improved this from my testing.

Comparison from WhisperAX debug build using whisper-large-v3, M1 Max, macOS Ventura.	Before
After

argmaxinc / WhisperKit

Use IOSurface-backed MLMultiArrays for float16 #130