Closed VProv closed 6 months ago
Hi @VProv, I tried reproducing your issue using our Python wrapper as follows:
from whisperkit.pipelines import WhisperKit
pipe = WhisperKit(code_commit_hash="c770b54", whisper_version="openai/whisper-base", out_dir="external")
pipe.transcribe("jfk.wav")
which produced the correct result. This makes me think it is the simulator that is causing issues.
I use an iOS simulator for iPhone 15 on Mac with M2 Mac with M2 is fine.
We use the iOS simulator in our unit tests as well. What is the iOS version you are using?
@VProv Please let us know if you are still hitting this issue and what the simulator iOS version is. Otherwise, I will close this issue in a few days. Thanks!
Hey! I use iOS Deployment Target 17.2. I switched to the "small" model for now.
Thanks for letting us know. If the issue surfaces again, please reopen the issue :)
@atiorh This issue happens when the compute units for both the Audio Encoder and Text Decoder are chosen as CPU. This will always happen when it's running in the iOS simulator. The whisper-base and whisper-base.en keep generating useless and repeated outputs in the Transcribe and Stream mode. The whisper-tiny and whisper-small will also fail in the Stream mode.
@VProv captures the issue for whisper-base just because in the iOS simulator, the mode will be CPU.
if WhisperKit.isRunningOnSimulator {
self.melCompute = .cpuOnly
self.audioEncoderCompute = .cpuOnly
self.textDecoderCompute = .cpuOnly
self.prefillCompute = .cpuOnly
return
}
This behavior also appears in the TestFlight app on my iPhone when both compute units are selected as CPU.
Hey!
I encountered problems when using the "base" model. When I use base model, it outputs garbage, If I change the model for "small", transcription works fine. I use a jfk.wav file from your tests
Here is my code:
print("ENTER transcribeFile") let whisper = try await WhisperKit(model:"small", verbose: true, logLevel: .debug) guard let url = Bundle.main.url(forResource: "jfk", withExtension: "wav") else { print("Failed to locate file in bundle.") return } // Transcribe the audio file print("Path to audio", url.path) let transcriptionResult: [TranscriptionResult] = try await whisper.transcribe(audioPath: url.path) print("Result: \(transcriptionResult)")
I use an iOS simulator for iPhone 15 on Mac with M2, c770b54 version of the lib
Output with "small": [WhisperKit.TranscriptionResult(text: "And so my fellow Americans, ask not what your country can do for you. Ask what you can do for your country.", segments: [WhisperKit.TranscriptionSegment(id: 0, seek: 0, start: 0.0, end: 11.0, text: "<|startoftranscript|><|en|><|transcribe|><|0.00|> And so my fellow Americans, ask not what your country can do for you. Ask what you can do for your country.<|11.00|><|endoftext|>", tokens: [50258, 50259, 50359, 50364, 400, 370, 452, 7177, 6280, 11, 1029, 406, 437, 428, 1941, 393, 360, 337, 291, 13, 12320, 437, 291, 393, 360, 337, 428, 1941, 13, 50914, 50257], tokenLogProbs: [[50258: 0.0], [50259: 0.0], [50359: 0.0], [50364: 0.0], [400: -0.23800875], [370: -6.903542], [452: -0.80536205], [7177: -1.7742634], [6280: -1.7742634], [11: -0.047607817], [1029: -0.047607817], [406: -0.32398117], [437: -0.32398117], [428: -0.010956124], [1941: -0.010956124], [393: -0.11716792], [360: -0.11716792], [337: -0.5108057], [291: -0.5108057], [13: -0.44561076], [12320: -0.44561076], [437: -0.3057709], [291: -0.3057709], [393: -0.12305092], [360: -0.12305092], [337: -0.067418754], [428: -0.067418754], [1941: -0.031324066], [13: -0.031324066], [50914: -0.041958194], [50257: -0.041958194]], temperature: 0.0, avgLogprob: -0.5015079, compressionRatio: 1.5625, noSpeechProb: 0.0, words: nil)], language: "en", timings: WhisperKit.TranscriptionTimings(pipelineStart: 736158477.070447, firstTokenTime: 736158478.576726, inputAudioSeconds: 11.0, modelLoading: 5.294854044914246, audioLoading: 0.0036209821701049805, audioProcessing: 0.0002830028533935547, logmels: 0.009383916854858398, encoding: 1.3972361087799072, prefill: 1.0013580322265625e-05, decodingInit: 0.012336015701293945, decodingLoop: 4.044028043746948, decodingPredictions: 2.507355213165283, decodingFiltering: 0.0004220008850097656, decodingSampling: 0.022192716598510742, decodingFallback: 0.0, decodingWindowing: 0.0015540122985839844, decodingKvCaching: 0.03674197196960449, decodingWordTimestamps: 0.0, decodingNonPrediction: 0.11589288711547852, totalAudioProcessingRuns: 1.0, totalLogmelRuns: 1.0, totalEncodingRuns: 1.0, totalDecodingLoops: 29.0, totalKVUpdateRuns: 29.0, totalTimestampAlignmentRuns: 0.0, totalDecodingFallbacks: 0.0, totalDecodingWindows: 1.0, fullPipeline: 4.056915044784546))]
Output with "base": Something random. I don't know what is the source of the problem
Result: [WhisperKit.TranscriptionResult(text: "We have to say that that is. We have had a lot of things with you. I know what we have to say. What we are. We can you have had to say. We are. But what we are. But I think you have to say. We have had to say. You had. To say. We are. We have had to say. To say. We have had to say. It say. I am. You I think we are. It. To say I think you. To say. We have had to. To say. We are. We have had to. I. I have. You have had to say. You have had to. To say. To do you have had to. To. To. I think you I have a lot. You I think. To say. It. It say you. It. It. To do You have had. We had. You I am. I think. You have I. That I am. You I have a lot. I think. We I am. That You have had", segments: [WhisperKit.TranscriptionSegment(id: 0, seek: 0, start: 0.0, end: 11.0, text: "<|startoftranscript|><|en|><|transcribe|><|0.00|> We have to say that that is. We have had a lot of things with you. I know what we have to say. What we are. We can you have had to say. We are. But what we are. But I think you have to say. We have had to say. You had. To say. We are. We have had to say. To say. We have had to say. It say. I am. You I think we are. It. To say I think you. To say. We have had to. To say. We are. We have had to. I. I have. You have had to say. You have had to. To say. To do you have had to. To. To. I think you I have a lot. You I think. To say. It. It say you. It. It. To do You have had. We had. You I am. I think. You have I. That I am. You I have a lot. I think. We I am. That You have had<|endoftext|>", tokens: [50258, 50259, 50359, 50364, 492, 362, 281, 584, 300, 300, 307, 13, 492, 362, 632, 257, 688, 295, 721, 365, 291, 13, 286, 458, 437, 321, 362, 281, 584, 13, 708, 321, 366, 13, 492, 393, 291, 362, 632, 281, 584, 13, 492, 366, 13, 583, 437, 321, 366, 13, 583, 286, 519, 291, 362, 281, 584, 13, 492, 362, 632, 281, 584, 13, 509, 632, 13, 1407, 584, 13, 492, 366, 13, 492, 362, 632, 281, 584, 13, 1407, 584, 13, 492, 362, 632, 281, 584, 13, 467, 584, 13, 286, 669, 13, 509, 286, 519, 321, 366, 13, 467, 13, 1407, 584, 286, 519, 291, 13, 1407, 584, 13, 492, 362, 632, 281, 13, 1407, 584, 13, 492, 366, 13, 492, 362, 632, 281, 13, 286, 13, 286, 362, 13, 509, 362, 632, 281, 584, 13, 509, 362, 632, 281, 13, 1407, 584, 13, 1407, 360, 291, 362, 632, 281, 13, 1407, 13, 1407, 13, 286, 519, 291, 286, 362, 257, 688, 13, 509, 286, 519, 13, 1407, 584, 13, 467, 13, 467, 584, 291, 13, 467, 13, 467, 13, 1407, 360, 509, 362, 632, 13, 492, 632, 13, 509, 286, 669, 13, 286, 519, 13, 509, 362, 286, 13, 663, 286, 669, 13, 509, 286, 362, 257, 688, 13, 286, 519, 13, 492, 286, 669, 13, 663, 509, 362, 632, 50257], tokenLogProbs: [[50258: 0.0], [50259: 0.0], [50359: 0.0], [50364: 0.0], [492: -0.46854308], [362: -4.649173e-06], [281: -3.9211664], [584: -2.8411748], [300: -2.8411748], [300: -2.3382592], [307: -2.3382592], [13: -1.2593474], [492: -1.2593474], [362: -2.2327487], [632: -2.2327487], [257: -2.8759716], [688: -2.8759716], [295: -2.248466], [721: -2.248466], [365: -2.7832522], [291: -2.7832522], [13: -2.7364671], [286: -2.7364671], [458: -2.3066056], [437: -2.3066056], [321: -1.2334995], [362: -1.2334995], [281: -2.8644958], [584: -2.8644958], [13: -2.7988439], [708: -2.7988439], [321: -1.6016155], [366: -1.6016155], [13: -0.24205148], [492: -0.24205148], [393: -2.3324401], [291: -2.3324401], [362: -3.216308], [632: -3.216308], [281: -2.767682], [584: -2.767682], [13: -2.2825544], [492: -2.2825544], [366: -3.2878008], [13: -3.2878008], [583: -3.4753323], [437: -3.4753323], [321: -2.2468977], [366: -2.2468977], [13: -2.8896592], [583: -2.8896592], [286: -1.9014113], [519: -1.9014113], [291: -1.7707164], [362: -1.7707164], [281: -1.5944285], [584: -1.5944285], [13: -2.0938516], [492: -2.0938516], [362: -3.3726246], [632: -3.3726246], [281: -2.470844], [584: -2.470844], [13: -2.8277507], [509: -2.8277507], [632: -2.6681852], [13: -2.6681852], [1407: -1.5980114], [584: -1.5980114], [13: -3.6893868], [492: -3.6893868], [366: -2.8718534], [13: -2.8718534], [492: -2.9086547], [362: -2.9086547], [632: -2.1391528], [281: -2.1391528], [584: -1.335645], [13: -1.335645], [1407: -1.0163437], [584: -1.0163437], [13: -0.960848], [492: -0.960848], [362: -1.6773031], [632: -1.6773031], [281: -2.3030858], [584: -2.3030858], [13: -1.9672419], [467: -1.9672419], [584: -3.4776788], [13: -3.4776788], [286: -3.0334747], [669: -3.0334747], [13: -1.7947521], [509: -1.7947521], [286: -1.5359652], [519: -1.5359652], [321: -1.4024606], [366: -1.4024606], [13: -3.08361], [467: -3.08361], [13: -2.7582512], [1407: -2.7582512], [584: -1.2870092], [286: -1.2870092], [519: -2.5009592], [291: -2.5009592], [13: -2.0639248], [1407: -2.0639248], [584: -1.4796957], [13: -1.4796957], [492: -1.2754928], [362: -1.2754928], [632: -0.95180154], [281: -0.95180154], [13: -1.7178613], [1407: -1.7178613], [584: -1.5293169], [13: -1.5293169], [492: -1.0799311], [366: -1.0799311], [13: -0.4571549], [492: -0.4571549], [362: -0.81585073], [632: -0.81585073], [281: -0.61659545], [13: -0.61659545], [286: -3.0646272], [13: -3.0646272], [286: -3.2575703], [362: -3.2575703], [13: -3.6666787], [509: -3.6666787], [362: -3.2766993], [632: -3.2766993], [281: -1.0730554], [584: -1.0730554], [13: -0.9041693], [509: -0.9041693], [362: -1.7604982], [632: -1.7604982], [281: -2.442537], [13: -2.442537], [1407: -1.1728677], [584: -1.1728677], [13: -1.7017894], [1407: -1.7017894], [360: -1.3470659], [291: -1.3470659], [362: -0.6904316], [632: -0.6904316], [281: -0.2578569], [13: -0.2578569], [1407: -0.90630615], [13: -0.90630615], [1407: -0.588514], [13: -0.588514], [286: -3.0094783], [519: -3.0094783], [291: -0.87447596], [286: -0.87447596], [362: -0.76001346], [257: -0.76001346], [688: -1.6966609], [13: -1.6966609], [509: -1.3823335], [286: -1.3823335], [519: -0.49830183], [13: -0.49830183], [1407: -0.29699734], [584: -0.29699734], [13: -0.82159287], [467: -0.82159287], [13: -0.55060595], [467: -0.55060595], [584: -3.6784413], [291: -3.6784413], [13: -2.5708647], [467: -2.5708647], [13: -0.7152253], [467: -0.7152253], [13: -3.4809964], [1407: -3.4809964], [360: -3.357663], [509: -3.357663], [362: -1.7842449], [632: -1.7842449], [13: -3.1116853], [492: -3.1116853], [632: -3.2315176], [13: -3.2315176], [509: -1.6467949], [286: -1.6467949], [669: -2.516142], [13: -2.516142], [286: -2.322375], [519: -2.322375], [13: -1.2451215], [509: -1.2451215], [362: -3.4987159], [286: -3.4987159], [13: -2.2404675], [663: -2.2404675], [286: -2.9150012], [669: -2.9150012], [13: -1.307585], [509: -1.307585], [286: -2.7741168], [362: -2.7741168], [257: -2.1526089], [688: -2.1526089], [13: -2.386625], [286: -2.386625], [519: -2.757761], [13: -2.757761], [492: -3.302908], [286: -3.302908], [669: -1.1213849], [13: -1.1213849], [663: -1.0268766], [509: -1.0268766], [362: -2.0226188], [632: -2.0226188], [50257: -2.0808136]], temperature: 1.0, avgLogprob: -1.9937032, compressionRatio: 4.0744185, noSpeechProb: 0.0, words: nil)], language: "en", timings: WhisperKit.TranscriptionTimings(pipelineStart: 736158939.772971, firstTokenTime: 736158940.323336, inputAudioSeconds: 11.0, modelLoading: 1.680719017982483, audioLoading: 0.0016698837280273438, audioProcessing: 0.0003139972686767578, logmels: 0.01194608211517334, encoding: 0.49132096767425537, prefill: 1.3947486877441406e-05, decodingInit: 0.006914973258972168, decodingLoop: 22.78679096698761, decodingPredictions: 19.62394654750824, decodingFiltering: 0.006839156150817871, decodingSampling: 0.46537697315216064, decodingFallback: 22.280248880386353, decodingWindowing: 0.0013300180435180664, decodingKvCaching: 0.3391835689544678, decodingWordTimestamps: 0.0, decodingNonPrediction: 2.4722235202789307, totalAudioProcessingRuns: 1.0, totalLogmelRuns: 1.0, totalEncodingRuns: 1.0, totalDecodingLoops: 762.0, totalKVUpdateRuns: 762.0, totalTimestampAlignmentRuns: 0.0, totalDecodingFallbacks: 5.0, totalDecodingWindows: 1.0, fullPipeline: 22.793906927108765))]