argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
http://argmaxinc.com/blog/whisperkit
MIT License
3.92k stars 330 forks source link

Problems with "base" model #131

Closed VProv closed 6 months ago

VProv commented 6 months ago

Hey!

I encountered problems when using the "base" model. When I use base model, it outputs garbage, If I change the model for "small", transcription works fine. I use a jfk.wav file from your tests

Here is my code: print("ENTER transcribeFile") let whisper = try await WhisperKit(model:"small", verbose: true, logLevel: .debug) guard let url = Bundle.main.url(forResource: "jfk", withExtension: "wav") else { print("Failed to locate file in bundle.") return } // Transcribe the audio file print("Path to audio", url.path) let transcriptionResult: [TranscriptionResult] = try await whisper.transcribe(audioPath: url.path) print("Result: \(transcriptionResult)")

I use an iOS simulator for iPhone 15 on Mac with M2, c770b54 version of the lib

Output with "small": [WhisperKit.TranscriptionResult(text: "And so my fellow Americans, ask not what your country can do for you. Ask what you can do for your country.", segments: [WhisperKit.TranscriptionSegment(id: 0, seek: 0, start: 0.0, end: 11.0, text: "<|startoftranscript|><|en|><|transcribe|><|0.00|> And so my fellow Americans, ask not what your country can do for you. Ask what you can do for your country.<|11.00|><|endoftext|>", tokens: [50258, 50259, 50359, 50364, 400, 370, 452, 7177, 6280, 11, 1029, 406, 437, 428, 1941, 393, 360, 337, 291, 13, 12320, 437, 291, 393, 360, 337, 428, 1941, 13, 50914, 50257], tokenLogProbs: [[50258: 0.0], [50259: 0.0], [50359: 0.0], [50364: 0.0], [400: -0.23800875], [370: -6.903542], [452: -0.80536205], [7177: -1.7742634], [6280: -1.7742634], [11: -0.047607817], [1029: -0.047607817], [406: -0.32398117], [437: -0.32398117], [428: -0.010956124], [1941: -0.010956124], [393: -0.11716792], [360: -0.11716792], [337: -0.5108057], [291: -0.5108057], [13: -0.44561076], [12320: -0.44561076], [437: -0.3057709], [291: -0.3057709], [393: -0.12305092], [360: -0.12305092], [337: -0.067418754], [428: -0.067418754], [1941: -0.031324066], [13: -0.031324066], [50914: -0.041958194], [50257: -0.041958194]], temperature: 0.0, avgLogprob: -0.5015079, compressionRatio: 1.5625, noSpeechProb: 0.0, words: nil)], language: "en", timings: WhisperKit.TranscriptionTimings(pipelineStart: 736158477.070447, firstTokenTime: 736158478.576726, inputAudioSeconds: 11.0, modelLoading: 5.294854044914246, audioLoading: 0.0036209821701049805, audioProcessing: 0.0002830028533935547, logmels: 0.009383916854858398, encoding: 1.3972361087799072, prefill: 1.0013580322265625e-05, decodingInit: 0.012336015701293945, decodingLoop: 4.044028043746948, decodingPredictions: 2.507355213165283, decodingFiltering: 0.0004220008850097656, decodingSampling: 0.022192716598510742, decodingFallback: 0.0, decodingWindowing: 0.0015540122985839844, decodingKvCaching: 0.03674197196960449, decodingWordTimestamps: 0.0, decodingNonPrediction: 0.11589288711547852, totalAudioProcessingRuns: 1.0, totalLogmelRuns: 1.0, totalEncodingRuns: 1.0, totalDecodingLoops: 29.0, totalKVUpdateRuns: 29.0, totalTimestampAlignmentRuns: 0.0, totalDecodingFallbacks: 0.0, totalDecodingWindows: 1.0, fullPipeline: 4.056915044784546))]

Output with "base": Something random. I don't know what is the source of the problem

Result: [WhisperKit.TranscriptionResult(text: "We have to say that that is. We have had a lot of things with you. I know what we have to say. What we are. We can you have had to say. We are. But what we are. But I think you have to say. We have had to say. You had. To say. We are. We have had to say. To say. We have had to say. It say. I am. You I think we are. It. To say I think you. To say. We have had to. To say. We are. We have had to. I. I have. You have had to say. You have had to. To say. To do you have had to. To. To. I think you I have a lot. You I think. To say. It. It say you. It. It. To do You have had. We had. You I am. I think. You have I. That I am. You I have a lot. I think. We I am. That You have had", segments: [WhisperKit.TranscriptionSegment(id: 0, seek: 0, start: 0.0, end: 11.0, text: "<|startoftranscript|><|en|><|transcribe|><|0.00|> We have to say that that is. We have had a lot of things with you. I know what we have to say. What we are. We can you have had to say. We are. But what we are. But I think you have to say. We have had to say. You had. To say. We are. We have had to say. To say. We have had to say. It say. I am. You I think we are. It. To say I think you. To say. We have had to. To say. We are. We have had to. I. I have. You have had to say. You have had to. To say. To do you have had to. To. To. I think you I have a lot. You I think. To say. It. It say you. It. It. To do You have had. We had. You I am. I think. You have I. That I am. You I have a lot. I think. We I am. That You have had<|endoftext|>", tokens: [50258, 50259, 50359, 50364, 492, 362, 281, 584, 300, 300, 307, 13, 492, 362, 632, 257, 688, 295, 721, 365, 291, 13, 286, 458, 437, 321, 362, 281, 584, 13, 708, 321, 366, 13, 492, 393, 291, 362, 632, 281, 584, 13, 492, 366, 13, 583, 437, 321, 366, 13, 583, 286, 519, 291, 362, 281, 584, 13, 492, 362, 632, 281, 584, 13, 509, 632, 13, 1407, 584, 13, 492, 366, 13, 492, 362, 632, 281, 584, 13, 1407, 584, 13, 492, 362, 632, 281, 584, 13, 467, 584, 13, 286, 669, 13, 509, 286, 519, 321, 366, 13, 467, 13, 1407, 584, 286, 519, 291, 13, 1407, 584, 13, 492, 362, 632, 281, 13, 1407, 584, 13, 492, 366, 13, 492, 362, 632, 281, 13, 286, 13, 286, 362, 13, 509, 362, 632, 281, 584, 13, 509, 362, 632, 281, 13, 1407, 584, 13, 1407, 360, 291, 362, 632, 281, 13, 1407, 13, 1407, 13, 286, 519, 291, 286, 362, 257, 688, 13, 509, 286, 519, 13, 1407, 584, 13, 467, 13, 467, 584, 291, 13, 467, 13, 467, 13, 1407, 360, 509, 362, 632, 13, 492, 632, 13, 509, 286, 669, 13, 286, 519, 13, 509, 362, 286, 13, 663, 286, 669, 13, 509, 286, 362, 257, 688, 13, 286, 519, 13, 492, 286, 669, 13, 663, 509, 362, 632, 50257], tokenLogProbs: [[50258: 0.0], [50259: 0.0], [50359: 0.0], [50364: 0.0], [492: -0.46854308], [362: -4.649173e-06], [281: -3.9211664], [584: -2.8411748], [300: -2.8411748], [300: -2.3382592], [307: -2.3382592], [13: -1.2593474], [492: -1.2593474], [362: -2.2327487], [632: -2.2327487], [257: -2.8759716], [688: -2.8759716], [295: -2.248466], [721: -2.248466], [365: -2.7832522], [291: -2.7832522], [13: -2.7364671], [286: -2.7364671], [458: -2.3066056], [437: -2.3066056], [321: -1.2334995], [362: -1.2334995], [281: -2.8644958], [584: -2.8644958], [13: -2.7988439], [708: -2.7988439], [321: -1.6016155], [366: -1.6016155], [13: -0.24205148], [492: -0.24205148], [393: -2.3324401], [291: -2.3324401], [362: -3.216308], [632: -3.216308], [281: -2.767682], [584: -2.767682], [13: -2.2825544], [492: -2.2825544], [366: -3.2878008], [13: -3.2878008], [583: -3.4753323], [437: -3.4753323], [321: -2.2468977], [366: -2.2468977], [13: -2.8896592], [583: -2.8896592], [286: -1.9014113], [519: -1.9014113], [291: -1.7707164], [362: -1.7707164], [281: -1.5944285], [584: -1.5944285], [13: -2.0938516], [492: -2.0938516], [362: -3.3726246], [632: -3.3726246], [281: -2.470844], [584: -2.470844], [13: -2.8277507], [509: -2.8277507], [632: -2.6681852], [13: -2.6681852], [1407: -1.5980114], [584: -1.5980114], [13: -3.6893868], [492: -3.6893868], [366: -2.8718534], [13: -2.8718534], [492: -2.9086547], [362: -2.9086547], [632: -2.1391528], [281: -2.1391528], [584: -1.335645], [13: -1.335645], [1407: -1.0163437], [584: -1.0163437], [13: -0.960848], [492: -0.960848], [362: -1.6773031], [632: -1.6773031], [281: -2.3030858], [584: -2.3030858], [13: -1.9672419], [467: -1.9672419], [584: -3.4776788], [13: -3.4776788], [286: -3.0334747], [669: -3.0334747], [13: -1.7947521], [509: -1.7947521], [286: -1.5359652], [519: -1.5359652], [321: -1.4024606], [366: -1.4024606], [13: -3.08361], [467: -3.08361], [13: -2.7582512], [1407: -2.7582512], [584: -1.2870092], [286: -1.2870092], [519: -2.5009592], [291: -2.5009592], [13: -2.0639248], [1407: -2.0639248], [584: -1.4796957], [13: -1.4796957], [492: -1.2754928], [362: -1.2754928], [632: -0.95180154], [281: -0.95180154], [13: -1.7178613], [1407: -1.7178613], [584: -1.5293169], [13: -1.5293169], [492: -1.0799311], [366: -1.0799311], [13: -0.4571549], [492: -0.4571549], [362: -0.81585073], [632: -0.81585073], [281: -0.61659545], [13: -0.61659545], [286: -3.0646272], [13: -3.0646272], [286: -3.2575703], [362: -3.2575703], [13: -3.6666787], [509: -3.6666787], [362: -3.2766993], [632: -3.2766993], [281: -1.0730554], [584: -1.0730554], [13: -0.9041693], [509: -0.9041693], [362: -1.7604982], [632: -1.7604982], [281: -2.442537], [13: -2.442537], [1407: -1.1728677], [584: -1.1728677], [13: -1.7017894], [1407: -1.7017894], [360: -1.3470659], [291: -1.3470659], [362: -0.6904316], [632: -0.6904316], [281: -0.2578569], [13: -0.2578569], [1407: -0.90630615], [13: -0.90630615], [1407: -0.588514], [13: -0.588514], [286: -3.0094783], [519: -3.0094783], [291: -0.87447596], [286: -0.87447596], [362: -0.76001346], [257: -0.76001346], [688: -1.6966609], [13: -1.6966609], [509: -1.3823335], [286: -1.3823335], [519: -0.49830183], [13: -0.49830183], [1407: -0.29699734], [584: -0.29699734], [13: -0.82159287], [467: -0.82159287], [13: -0.55060595], [467: -0.55060595], [584: -3.6784413], [291: -3.6784413], [13: -2.5708647], [467: -2.5708647], [13: -0.7152253], [467: -0.7152253], [13: -3.4809964], [1407: -3.4809964], [360: -3.357663], [509: -3.357663], [362: -1.7842449], [632: -1.7842449], [13: -3.1116853], [492: -3.1116853], [632: -3.2315176], [13: -3.2315176], [509: -1.6467949], [286: -1.6467949], [669: -2.516142], [13: -2.516142], [286: -2.322375], [519: -2.322375], [13: -1.2451215], [509: -1.2451215], [362: -3.4987159], [286: -3.4987159], [13: -2.2404675], [663: -2.2404675], [286: -2.9150012], [669: -2.9150012], [13: -1.307585], [509: -1.307585], [286: -2.7741168], [362: -2.7741168], [257: -2.1526089], [688: -2.1526089], [13: -2.386625], [286: -2.386625], [519: -2.757761], [13: -2.757761], [492: -3.302908], [286: -3.302908], [669: -1.1213849], [13: -1.1213849], [663: -1.0268766], [509: -1.0268766], [362: -2.0226188], [632: -2.0226188], [50257: -2.0808136]], temperature: 1.0, avgLogprob: -1.9937032, compressionRatio: 4.0744185, noSpeechProb: 0.0, words: nil)], language: "en", timings: WhisperKit.TranscriptionTimings(pipelineStart: 736158939.772971, firstTokenTime: 736158940.323336, inputAudioSeconds: 11.0, modelLoading: 1.680719017982483, audioLoading: 0.0016698837280273438, audioProcessing: 0.0003139972686767578, logmels: 0.01194608211517334, encoding: 0.49132096767425537, prefill: 1.3947486877441406e-05, decodingInit: 0.006914973258972168, decodingLoop: 22.78679096698761, decodingPredictions: 19.62394654750824, decodingFiltering: 0.006839156150817871, decodingSampling: 0.46537697315216064, decodingFallback: 22.280248880386353, decodingWindowing: 0.0013300180435180664, decodingKvCaching: 0.3391835689544678, decodingWordTimestamps: 0.0, decodingNonPrediction: 2.4722235202789307, totalAudioProcessingRuns: 1.0, totalLogmelRuns: 1.0, totalEncodingRuns: 1.0, totalDecodingLoops: 762.0, totalKVUpdateRuns: 762.0, totalTimestampAlignmentRuns: 0.0, totalDecodingFallbacks: 5.0, totalDecodingWindows: 1.0, fullPipeline: 22.793906927108765))]

atiorh commented 6 months ago

Hi @VProv, I tried reproducing your issue using our Python wrapper as follows:

from whisperkit.pipelines import WhisperKit
pipe = WhisperKit(code_commit_hash="c770b54", whisper_version="openai/whisper-base", out_dir="external")
pipe.transcribe("jfk.wav")

which produced the correct result. This makes me think it is the simulator that is causing issues.

I use an iOS simulator for iPhone 15 on Mac with M2 Mac with M2 is fine.

We use the iOS simulator in our unit tests as well. What is the iOS version you are using?

atiorh commented 6 months ago

@VProv Please let us know if you are still hitting this issue and what the simulator iOS version is. Otherwise, I will close this issue in a few days. Thanks!

VProv commented 6 months ago

Hey! I use iOS Deployment Target 17.2. I switched to the "small" model for now.

atiorh commented 6 months ago

Thanks for letting us know. If the issue surfaces again, please reopen the issue :)

Willyoung2017 commented 4 months ago

@atiorh This issue happens when the compute units for both the Audio Encoder and Text Decoder are chosen as CPU. This will always happen when it's running in the iOS simulator. The whisper-base and whisper-base.en keep generating useless and repeated outputs in the Transcribe and Stream mode. The whisper-tiny and whisper-small will also fail in the Stream mode.

@VProv captures the issue for whisper-base just because in the iOS simulator, the mode will be CPU.

if WhisperKit.isRunningOnSimulator {
      self.melCompute = .cpuOnly
      self.audioEncoderCompute = .cpuOnly
      self.textDecoderCompute = .cpuOnly
      self.prefillCompute = .cpuOnly
      return
    }

This behavior also appears in the TestFlight app on my iPhone when both compute units are selected as CPU. IMG_5246