argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
http://argmaxinc.com/blog/whisperkit
MIT License
3.97k stars 335 forks source link

Stalling/freezing when using --model-path in CLI #208

Closed maxlund closed 2 months ago

maxlund commented 2 months ago

Hi!

When using --model-path to a pre-downloaded model, there seems to be something that makes CLI calls stall:

./whisperkit-cli transcribe --verbose --skip-special-tokens --report --report-path '/Users/maxlund/whisper-reports' --model-path '/Users/maxlund/whisperkit-models/argmaxinc/whisperkit-coreml/distil-whisper_distil-large-v3_turbo' --audio-path '/Users/maxlund/audio/arthur-fx2.wav' --language 'en'
Task: Transcribe audio at ["/Users/maxlund/audio/arthur-fx2.wav"]
Initializing models...
Models initialized in 521.27 seconds
  - Encoder load time: 519.13 seconds
  - Decoder load time: 1.69 seconds
  - Tokenizer load time: 0.34 seconds

As you can see it took almost 9 minutes to initialise the model.

The second time we run the CLI call however, it's fast:

./whisperkit-cli transcribe --verbose --skip-special-tokens --report --report-path '/Users/maxlund/whisper-reports' --model-path '/Users/maxlund/whisperkit-models/argmaxinc/whisperkit-coreml/distil-whisper_distil-large-v3_turbo' --audio-path '/Users/maxlund/audio/arthur-fx2.wav' --language 'en'
Task: Transcribe audio at ["/Users/maxlund/audio/arthur-fx2.wav"]
Initializing models...
Models initialized in 1.25 seconds
  - Encoder load time: 0.83 seconds
  - Decoder load time: 0.06 seconds
  - Tokenizer load time: 0.27 seconds

It seems files are still being created in ~/Library/Caches/whisperkit-cli and it seems to be 1.5gb large even though I specified the --model-path. I suspect this has something to do with it?

Is there anything you can pre-download or files you can pre-populate in order to make startup faster (and possible) in a completely offline environment?

atiorh commented 2 months ago

Hi @maxlund, this is unlikely to be related to the model being pre-downloaded. The additional file you observed is the system cache that Core ML has to generate so the generic model is specialized to your device's chip generation and the selected compute unit. When you first load the model, the cache is generated which takes a long time. We do have a solution coming up for this so stay tuned. This cache can not be pre-populated with publicly available APIs.

maxlund commented 2 months ago

Thank you for the clarification! Any idea on how different those files are from machine to machine? Could we pre-create the files for e.g. the distil large-v3 model, and bundle these files in an application to run on macOS?

latenitefilms commented 2 months ago

Hi @atiorh ! Following on from Max's questions...

The additional file you observed is the system cache that Core ML has to generate so the generic model is specialized to your device's chip generation and the selected compute unit. When you first load the model, the cache is generated which takes a long time.

Are you able to provide give some more information/insight into this? Does this process happen faster on some machines than others? Does it only happen on macOS 15 and above?

We do have a solution coming up for this so stay tuned.

Are there any WIP branches I could play with? Is there a rough ETA of when this could be improved - weeks or months?

This cache can not be pre-populated with publicly available APIs.

Is this because it's proprietary CoreML magic that's happening under the macOS hood?

Thanks heaps for your time!

latenitefilms commented 2 months ago

...and a dumb question... whilst the CoreML cache is being generated, would it be possible to just do WhisperKit processing on the CPU & GPU, until the CoreML cache is ready, then start using the Apple Neural Engine/NPU?

latenitefilms commented 2 months ago

Playing around with @ZachNagengast's branches - it looks like MLX might solve this?

maxlund commented 1 month ago

@atiorh I am unfortunately experiencing issues where this delay is happening more than once for the same model, even if it is located at the same path as previously etc. After a reboot we see ANECompilerService taking up a lot of resources and the model taking ~9min to produce a transcription result. Any guidance on possible ways to mitigate this would be much appreciated!

ZachNagengast commented 4 weeks ago

MLX definitely helps with this as long as you can spare the memory to load both models at the same time. We also rolled out some experimental changes in the latest testflight, try it out and see how the speeds compare, it's enabled in the experimental section in the settings.