argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
3.17k stars 267 forks source link

Allow MLX and CoreML to coexist #156

Closed jkrukowski closed 3 months ago

jkrukowski commented 4 months ago

PR

This PR allows MLX and CoreML to coexist, this is done by:

Problem

There is one problem which I've been debugging for a while now. The current architecture requires to convert back and forth between MLMultiArray and MLXArray. Additionally, it requires to rehape the MLXArray so it fits the CoreML model input. There are some helper methods to do so:

To tests the correctness of these methods, I've added testArrayConversion test. It fails for one case, when asMLXOutput().asMLMultiArray().asMLXArray(Int32.self).asMLXInput() are chained in that particular order. It's really hard to explain because it should be working correctly: first we expand and change the shape of the array, then we convert it to MLMultiArray, then we convert it back to MLXArray and finally we change the shape back to the original one. The result should be the same as the original array, but it's not.

This manifests itself in the WhisperKit when we try to use MLXFeatureExtractor and MLXAudioEncoder. The output is usually empty transcription (when I try to use just MLXFeatureExtractor transcription is correct).

I suspect that there might be something wrong with converting from MLXArray to MLMultiArray but I didn't find it yet

Edit: Solution

The issue was in asMLXArray, pointed by @ZachNagengast -- when converting to MLXArray we need to use the strides of the MLMultiArray