argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
http://argmaxinc.com/blog/whisperkit
MIT License
3.97k stars 337 forks source link

Making Sense of TranscriptionSegment IDs #258

Open bradleyandrew opened 2 weeks ago

bradleyandrew commented 2 weeks ago

Hello,

Thank you very much for all of your work on this project, it's fantastic!

I've been building something around it and exploring how it works and had a question that I'm hoping you can help with please. In my context I am using Stream Only, essentially taking live audio from a microphone source and having it transcribe in real time.

When I call 'transcribeAudioSamples' based on the example project I get a TranscriptionResult? as a return. My current logic looks at TranscriptionResult -> 'segments' which is an array: [TranscriptionSegment]. I assume that the last segment is the most recent and use it to update my UI.

In the example app it uses a ForEach to enumerate the 'confirmedSegments' and draws them, the problem I find with this approach is that SwiftUI re-draws the 'Text' each time, So when the UI updates, it gets replaced which doesn't feel fluent when dealing with longer lines of text.

The approach I have taken in my implementation is that I store a reference of TranscriptionSegment -> id which is essentially an index. Then when the TranscriptionResult is updated it refers to the previous segment via it's id and updates the old text with the new text in the UI. This assumes that the newest text is always the most accurate. I have found this to be correct most of the time, the initial results are a bit rough and then they are refined shortly after.

I assumed that the ID would increment sequentially, this is usually the case but not always. Given Whisper works in 30s chunks this seems to be when the segment switches over. So 0s - 30s will be id: 0, then 30s - 60s will be id:1, etc.

Though sometimes when feeding it a lot of words/talking in a shorter time period, it will just up to 'id: 1' at say the 17s mark rather than 30s, then it will jump back to id: 0. Other times I've seen it jump from id: 0 -> id: 4 then back down to id: 1, this is in terms of what is being returned from WhisperKit as a TranscriptionResult?. Any idea of what is going on here?

When it jumps around like this the text is scattered all of the place. Sometimes the start of a spoken sentence is in id: 0, then the middle of the sentence goes up to the segment with id: 1, then the tail end of the sentence is back to id: 0, when it hits the 30s mark it seems to revert to normal.

Any thoughts you can offer would be kindly appreciated.

Thanks!

ZachNagengast commented 2 weeks ago

Good questions. The id is actually only relevant for a single transcription run, so it will reset to starting at 0 in streaming mode every time. However they will all come in sequentially in that mode, so you'd be best tracking them outside of the library. You could extend the result object to hold the latest seek time and sort that way, but you'll need a secondary sorting for transcriptions that have multiple segments, where the id will be valid. Otherwise in our example app we just store them all as they get confirmed using the end timestamp value and ignore the id value, you can see that logic here https://github.com/argmaxinc/WhisperKit/blob/e8eebbe0641af380e03119f6d9cfa7551ba6440f/Examples/WhisperAX/WhisperAX/Views/ContentView.swift#L1376 and here https://github.com/argmaxinc/WhisperKit/blob/e8eebbe0641af380e03119f6d9cfa7551ba6440f/Examples/WhisperAX/WhisperAX/Views/ContentView.swift#L1562. We have plans to bring this into the library where this should be smoother, but that is a pending todo.

bradleyandrew commented 2 weeks ago

Thank you for the explanation Zach.

Having the index reset each time a new stream transcription run starts makes sense, and having them come in sequentially is how I have seen them execute, except for some edge cases where it will jump from index 0 to index 4 and then back to index 0, then continue to index 1, etc. I'm not too sure why that is occurring and I haven't found any reliable steps to reproduce it.

I did see the 'lastConfirmedSegment' approach in the example app, looking at it closer now it makes sense using the timestamps. Though it was unclear why 'requiredSegmentsForConfirmation' is 4 by default? Assuming that each segment is 30s that means you wouldn't get a confirmed segment until 2m in, though this is not the case as I saw the example app confirm segments via a transition from light grey -> black much earlier, in this case the confirmed segments are much shorter than the 30s window.

I did have a few other observations and queries, I'm not sure if here is the best place to discuss them as they differ to the index issue I mentioned above. But I'll quickly cover them anyway.


When I initiate WhisperKit I do it as follows:

let config = WhisperKitConfig(model: settings.selectedModel, prewarm: true, load: true)
whisperKit = try? await WhisperKit(config)

This happens as soon as the app launches so that the model is ready to go as soon as possible. In this case, settings.selectedModel = "base.en"

When I am connected to the internet this works fine, it will load the model from disk if present, if not it will download the model from Hugging Face. Though if I am not connected to WIFI or Cellular and the model has already been downloaded on disk, it will not initiate. If the device is offline it needs to be initiated as follows:

let config = WhisperKitConfig(model: settings.selectedModel, modelFolder: localModelFolderURL(forModel: settings.selectedModel).relativePath, prewarm: true)
whisperKit = try? await WhisperKit(config)

This is how I fetch the local model folder:

    private func localModelFolderURL(forModel model: String) -> URL {
        return documentDirectoryURL()
            .appendingPathComponent("huggingface")
            .appendingPathComponent("models")
            .appendingPathComponent("argmaxinc")
            .appendingPathComponent("whisperkit-coreml")
            .appendingPathComponent("openai_whisper-\(model)")
    }

Given this is the case, the logic I have implemented in my app uses the Reachability Library: https://github.com/ashleymills/Reachability.swift

Essentially it will check if the device is connected to the internet via Reachability, if so it will init WhisperKit via the online method, if not it will init via the offline method. Perhaps I am misusing the library or misunderstanding but is there a better way to do this?


I appreciate the new Model State Callback, it's super helpful to issue UI Updates based on what the model is doing behind the scenes. One issue I have come across is that the callback is not active until WhisperKit is actually initialized, though during init, WhisperKit can change a model from 'downloading' -> 'downloaded' -> 'prewarming', etc.

My example would be as follows:

            whisperKit = try? await WhisperKit(config)

            // Model State Callback
            whisperKit?.modelStateCallback = { (oldState: ModelState?, newState: ModelState) in
                self.debug("Model State Changed: \(oldState?.description ?? "nil") -> \(newState.description)")
                DispatchQueue.main.async {
                    self.modelState = newState
                }
            }

I wonder if there is a way to pass the Model State Callback into the config or the initializer so that it can be used immediately?


One last thing, I'm curious as to why the 'medium' and 'medium.en' models are not included in the WhisperKit Hugging Face Repo? I get that I can create my own but the default option is fantastic and it'd be nice to have these is options if possible.

Thanks Zach!

ZachNagengast commented 2 weeks ago

it was unclear why 'requiredSegmentsForConfirmation' is 4 by default?

This is a heuristic we came up with that seemed reasonable, but is adjustable for your use case. Normally a 30s chunk will have more than 1 segment but in some cases (usually at the start) it will do the entire 30s in one segment - this is actually a todo to fix, since our current pipeline has some kind of bug that doesn't find as many timestamp tokens in the first window, but finds them more often in subsequent windows. The goal is just to make sure that the input audio is not partially transcribed with hallucinations at the end (whisper tends to repeat tokens at the very end of a partial window) but the unconfirmed segments tend to have decent results anyway. Eager mode has a different confirmation scheme altogether which relies on local agreement between runs. You can test transcribing between these two different modes in the settings of WhisperAX in the bottom experimental section.

Though if I am not connected to WIFI or Cellular and the model has already been downloaded on disk, it will not initiate.

This may be a regression, but we do have a init config parameter download: false that you may be able to set such that the library will init and then you can call loadModels() separately. If its not finding the models even then while offline we should fix this.

I wonder if there is a way to pass the Model State Callback into the config or the initializer so that it can be used immediately?

This is a good callout, we also want to do this for the recent callbacks added with #240. #help-wanted

I'm curious as to why the 'medium' and 'medium.en' models are not included in the WhisperKit Hugging Face Repo?

These models are unfortunately not good candidates for CoreML, but small is pretty close in regards to WER, and the recent turbo v20241018 is a decent alternative as well.

atiorh commented 2 weeks ago

These models are unfortunately not good candidates for CoreML, but small is pretty close in regards to WER, and the recent turbo v20241018 is a decent alternative as well.

To expand on this: ANECompiler generates an incorrect program for medium model specifically so we decided not to support it out of an abundance of caution (Running with cpuAndGPU still works as expected but it is not a clean support). We definitely recommend either small or large-v3-v20240930 variants instead of medium.