argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
3.1k stars 256 forks source link

Not working on visionOS 2 #185

Open florentmorin opened 1 month ago

florentmorin commented 1 month ago

Hello,

I’ve encountered an issue where WhisperKit, which works perfectly on visionOS 1, no longer functions correctly on visionOS 2 beta 3. (native code)

Here’s a snippet of the code I’m using:

let whisperKit = try await WhisperKit(load: true)
// ...
let results = try await whisperKit.transcribe(
    audioPath: audioPath, 
    decodeOptions: .init(language: language), 
    callback: callback
)

I have attempted the following troubleshooting steps:

Despite these efforts, the issue persists.

🙏 Thanks for your help.

atiorh commented 1 month ago

Hello! What is the particular issue?

florentmorin commented 1 month ago

@atiorh

Result with ted_60.m4a in debug console mode returns "I am a very kind of a a a a a ..."

Console output ``` ▿ 1 element ▿ 0 : TranscriptionResult - text : "I am a very kind of a a a a a a a a a a a b a a a b e a e b e a b e a a b a a a f e d b d a a d f a b a a d b d f a a b a b a a d b a a a a a b b f a a a b b b a b b a a b a d b b b b b b b b b b f b b b b b b b b b b c b b b b b b b b b b b b b b b b b b b b b b c f b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b g e g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>" ▿ segments : 2 elements ▿ 0 : TranscriptionSegment - id : 0 - seek : 0 - start : 0.0 - end : 30.0 - text : "<|startoftranscript|><|en|><|transcribe|><|0.00|> I am a very kind of a a a a a a a a a a a b a a a b e a e b e a b e a a b a a a f e d b d a a d f a b a a d b d f a a b a b a a d b a a a a a b b f a a a b b b a b b a a b a d b b b b b b b b b b f b b b b b b b b b b c b b b b b b b b b b b b b b b b b b b b b b c f b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b g e g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g g<|endoftext|>" ▿ tokens : 224 elements - 0 : 50258 - 1 : 50259 - 2 : 50359 - 3 : 50364 - 4 : 286 - 5 : 669 - 6 : 257 - 7 : 588 [...] - 221 : 290 - 222 : 290 - 223 : 50257 ▿ tokenLogProbs : 224 elements ▿ 0 : 1 element ▿ 0 : 2 elements - key : 50258 - value : 0.0 ▿ 1 : 1 element ▿ 0 : 2 elements - key : 50259 - value : 0.0 ▿ 2 : 1 element ▿ 0 : 2 elements - key : 50359 - value : 0.0 ▿ 3 : 1 element ▿ 0 : 2 elements - key : 50364 - value : 0.0 ▿ 4 : 1 element ▿ 0 : 2 elements - key : 286 - value : -3.8560698 ▿ 5 : 1 element ▿ 0 : 2 elements - key : 669 - value : -1.9917988 ▿ 6 : 1 element ▿ 0 : 2 elements - key : 257 - value : -2.342525 ▿ 7 : 1 element ▿ 0 : 2 elements - key : 588 - value : -3.8727763 ▿ 8 : 1 element ▿ 0 : 2 elements - key : 733 - value : -4.0190206 ▿ 9 : 1 element ▿ 0 : 2 elements - key : 295 - value : -0.53651595 [...] ▿ 222 : 1 element ▿ 0 : 2 elements - key : 290 - value : -0.005539653 ▿ 223 : 1 element ▿ 0 : 2 elements - key : 50257 - value : 0.0 - temperature : 1.0 - avgLogprob : -0.7730474 - compressionRatio : 8.504854 - noSpeechProb : 0.0 - words : nil ▿ 1 : TranscriptionSegment - id : 1 - seek : 480000 - start : 30.0 - end : 60.0 - text : "<|startoftranscript|><|en|><|transcribe|><|0.00|> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >><|endoftext|>" ▿ tokens : 224 elements - 0 : 50258 - 1 : 50259 - 2 : 50359 - 3 : 50364 - 4 : 902 - 5 : 902 - 6 : 902 - 7 : 902 [...] - 222 : 902 - 223 : 50257 ▿ tokenLogProbs : 224 elements ▿ 0 : 1 element ▿ 0 : 2 elements - key : 50258 - value : 0.0 ▿ 1 : 1 element ▿ 0 : 2 elements - key : 50259 - value : 0.0 ▿ 2 : 1 element ▿ 0 : 2 elements - key : 50359 - value : 0.0 ▿ 3 : 1 element ▿ 0 : 2 elements - key : 50364 - value : 0.0 ▿ 4 : 1 element ▿ 0 : 2 elements - key : 902 - value : -3.266967 ▿ 5 : 1 element ▿ 0 : 2 elements - key : 902 - value : -1.445334 ▿ 6 : 1 element ▿ 0 : 2 elements - key : 902 - value : -0.5025386 [...] ▿ 221 : 1 element ▿ 0 : 2 elements - key : 902 - value : -0.010405066 ▿ 222 : 1 element ▿ 0 : 2 elements - key : 902 - value : -0.010017514 ▿ 223 : 1 element ▿ 0 : 2 elements - key : 50257 - value : 0.0 - temperature : 1.0 - avgLogprob : -0.037366472 - compressionRatio : 62.57143 - noSpeechProb : 0.0 - words : nil - language : "en" ▿ timings : TranscriptionTimings - pipelineStart : 742328429.696212 - firstTokenTime : 742328429.96273 - inputAudioSeconds : 60.0 - modelLoading : 1.0579739809036255 - audioLoading : 0.028718948364257812 - audioProcessing : 0.0010149478912353516 - logmels : 0.07123291492462158 - encoding : 0.35921692848205566 - prefill : 2.5987625122070312e-05 - decodingInit : 0.005879998207092285 - decodingLoop : 32.84668803215027 - decodingPredictions : 24.473331809043884 - decodingFiltering : 0.022301673889160156 - decodingSampling : 1.4053410291671753 - decodingFallback : 32.41148316860199 - decodingWindowing : 0.0008039474487304688 - decodingKvCaching : 0.990674614906311 - decodingWordTimestamps : 0.0 - decodingNonPrediction : 7.911233186721802 - totalAudioProcessingRuns : 2.0 - totalLogmelRuns : 2.0 - totalEncodingRuns : 2.0 - totalDecodingLoops : 2423.0 - totalKVUpdateRuns : 2411.0 - totalTimestampAlignmentRuns : 0.0 - totalDecodingFallbacks : 5.0 - totalDecodingWindows : 2.0 - fullPipeline : 32.85519599914551 - seekTime : nil ```
ZachNagengast commented 3 weeks ago

@florentmorin This looks similar to CPU only output. I believe that since visonOS doesn't let us use ANE, the model is using only CPU in this case. Could you try specifying GPU like this:

let computeOptions = ModelComputeOptions(audioEncoderCompute: .cpuAndGPU, textDecoderCompute: .cpuAndGPU)
let whisperKit = try await WhisperKit(computeOptions: computeOptions, load: true)