argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
http://argmaxinc.com/blog/whisperkit
MIT License
3.87k stars 329 forks source link

Issue getting keyword timestamps + not getting results on simulator #189

Open day-dreaming-guy opened 3 months ago

day-dreaming-guy commented 3 months ago

Hey! Thank you for this great repo.

I'd love some help with the following issues:

Sample code:

     // Use CPU as we are testing in simulator, should not effect result [?]
    let modelComputeOptions = ModelComputeOptions(
            melCompute: .cpuOnly,
            audioEncoderCompute: .cpuOnly,
            textDecoderCompute: .cpuOnly,
            prefillCompute: .cpuOnly)

        Task {
            do  {
                let pipe = try await WhisperKit(computeOptions: modelComputeOptions)
                let path = Bundle.main.url(forResource: "vocals", withExtension: "wav")!.path
                let transcriptions = try await pipe.transcribe(audioPath: path)
                for transction in transcriptions {
                    for segment in transction.segments {
                        let startTime = segment.start
                        let endTime = segment.end
                        let text = segment.text
                        print("Text: \(text), Start Time: \(startTime), End Time: \(endTime)")
                    }
                }
            } catch {
                print(error.localizedDescription)
            }
        }

Song used: song.wav.zip

Output:

Text: <|startoftranscript|><|en|><|transcribe|><|0.00|><|endoftext|>, Start Time: 0.0, End Time: 30.0 Text: <|startoftranscript|><|en|><|transcribe|><|0.00|> The I'm I'm I'm the I'm the I'm I'm I'm I'm I'm I'm I'm the I'm I'm I am I'm the I'm the I am I'm I'm I'm I'm my I'm I'm my I'm my I'm the I'm my I'm I'm I'm the the you know I'm I'm my I'm I don't know how I think I'm not I don't know I'm I'm I think I think I don I like I like the I'm I'm I'm I'm I like you I'm I'm I'm I'm I'm I'm I'm I'm I'm I'm I'm I'm I'm I'm I'm I'm I'm I love I love I love my I love my I'm I'm I'm I'm I'm the I love I love I love I'm love I love I like I like I love my I'm I'm you I'm you can you go go go go go go go go go go go go go go go go<|endoftext|>, Start Time: 30.0, End Time: 60.0 Text: <|startoftranscript|><|en|><|transcribe|><|0.00|> I'm in the way. The way. The way.<|endoftext|>, Start Time: 60.0, End Time: 90.0 Text: <|startoftranscript|><|en|><|transcribe|><|0.00|> The fact that the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the I think I like the I think the the the the the the the the the the the the the the the the the the the the the the the the the the the the you can do do do do do do do do do<|endoftext|>, Start Time: 90.0, End Time: 120.0 Text: <|startoftranscript|><|en|><|transcribe|><|0.00|> >> >><|endoftext|>, Start Time: 120.0, End Time: 150.0 Text: <|startoftranscript|><|en|><|transcribe|><|0.00|><|endoftext|>, Start Time: 150.0, End Time: 180.0 Text: <|startoftranscript|><|en|><|transcribe|><|0.00|> I'm a little bit. I'm a little bit. I'm a little bit. I'm a little bit I'm a little bit. I think we've had an amazing feeling that we were all in our time. We were all in my mom we were all in my mom we were all in our time we are all in our time. We were all in my mom we were all in my mom We were both in my mom We were both in my mom we were all in my mom We were all in her. She was all in my mom We were all in my mom We were all in my mom We were all in my mom we were all in my mom we were all in my mom We were all in my mom We were all in our mom we were all in my mom We were all in my mom We were all in my mom We were all in my mom We were all in my mom We were all in my mom We were all in my mom We were all we were all in my mom We were all in my mom We were all in my mom We were<|endoftext|>, Start Time: 180.0, End Time: 210.0 Text: <|startoftranscript|><|en|><|transcribe|><|0.00|> >> So you can. >> You have. You have been. We can do you have been.<|endoftext|>, Start Time: 210.0, End Time: 215.55081

ZachNagengast commented 2 months ago

Are you getting similar issues on a real device? The models are optimized for ANE and GPU compute units, and does not perform accurately on CPU. While it will output tokens and be somewhat useful for development, we recommend mainly testing on a real device.