argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
http://argmaxinc.com/blog/whisperkit
MIT License
3.95k stars 332 forks source link

Properly init WhisperKit regarding Model per Device management #254

Open Alonnasi opened 2 weeks ago

Alonnasi commented 2 weeks ago

Hello everybody 😇

Hoping I will get some guidance here 🙏

I'm trying to manage the Models in terms of Downloading, Storing, and Device/Model Management, and I have some issues/questions that, in hope, will help me & others understand better and make the best of this awesome Kit.

  1. In order to begin the Model download as early as possible, and cut down the waiting time, I'm performing the init in the AppDelegate & storing the Kit in the Manager Class scope:

Screenshot 2024-11-05 at 20 13 54

But, I'm facing some error that comes from the Kit logger, right after the Model finishes downloading. This error comes and goes and is not consistent, it can come when initiating the same Model that once worked, and different ones:

Screenshot 2024-11-05 at 20 31 52

What am I doing wrong?

  1. In cases which the Kit has initiated with no errors, when tryin to Transcribe a simple audio file, I'm getting dozens of ">>" segments results - Am I abusing the system? Am I using the wrong Model? I'm adding a screenshot of the Transcribe func:

Screenshot 2024-11-05 at 20 39 16

  1. Final Question, if I may. Is there a way for me to download the Model files straight into the App Project, and init them locally, with no need to download any Model? Is there a Model that can calculate Word Timestamps AND run on all devices?

Thanks so much for any help 🙌

atiorh commented 2 weeks ago

Hi @Alonnasi!

In cases which the Kit has initiated with no errors, when tryin to Transcribe a simple audio file, I'm getting dozens of ">>" segments results - Am I abusing the system? Am I using the wrong Model?

It depends on the file (feel free to share a link) but dozens of transcription segments is not out of the ordinary. You can always cross-reference the results from our TestFlight app to what you are observing in your project as a sanity check.

Final Question, if I may. Is there a way for me to download the Model files straight into the App Project, and init them locally, with no need to download any Model? Is there a Model that can calculate Word Timestamps AND run on all devices?

You can always pre-download and bundle the models but your app download size will bloat so the trade-off is yours to make. Word timestamps are supported on all models. tiny and base variants are supported on all Apple Silicon Macs + iPhone XS and newer. Please use the GPU for the AudioEncoder model on iPhone XS, XR and 11. We will make these presets available soon so you shouldn't have to make device-specific defaults on your side.

Alonnasi commented 2 weeks ago

Thank you so much for the quick response! 😇

I've managed to make tests on Various devices which I tried almost Every model on each device (to compare results):

Thank you again 🙌

atiorh commented 2 weeks ago

I didn't found ANY model that runs and works on iPhone 11 / XS / XR. I've tried base, tiny, small, large 2, large 3 & turbo.

Can you please elaborate about the AudioEncoder usage? Will it help with running transcribe tasks on iPhone 11 / XS / XR?

For WhisperKit's computeOptions, you will need to set audioEncoderCompute to be cpuAndGPU.