argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
3.17k stars 268 forks source link

Streaming Microphone for CLI #35

Closed jkrukowski closed 7 months ago

jkrukowski commented 7 months ago

This PR:

~AudioWarper~ AudioStreamTranscriber can replace the streaming logic in the example app as well

Resolves: https://github.com/argmaxinc/WhisperKit/issues/25

jkrukowski commented 7 months ago

@ZachNagengast thanks for your comments, added some changes, lmk what you think

I'm curious about the Transcriber protocol, can you elaborate on your use case for that?

In order to do the transcription in AudioStreamTranscriber I'd to need to pass the whole WhisperKit class there. I'd rather pass an object that contains just the methods I need and Transcriber protocol could be a 1st step towards that. I imagine that there is a separate class that implements the Transcriber protocol and contains the transcribe methods that currently are in WhisperKit. This way I could:

Since we have this new AudioWarper here now, do you think there's any recording code in AudioProcessor that would fit into here as well? Would be nice to have a few debug logs from Logging.debug in this section as well.

added more logging, changed name to AudioStreamTranscriber

One other thing: I think the microphone streaming should be explicit in swift run transcribe --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" such as a --stream boolean argument. Reason for that is ideally giving people a heads up if they forgot to include --audio-path, and only requesting the microphone if we are sure they want to stream.

added --stream flag

This should allow significant cleanup for the example apps with this shared interface, but that can happen separately, nicely done!

I could work on this cleanup ofc

ZachNagengast commented 7 months ago

Alright tried this out and have just some minor tweaks to the UI:

Everything else looks good, I'll try to help with this as well after word timestamps.

jkrukowski commented 7 months ago
  • When running initially, it should output some info about the model status to the cli, such as "Loading models..." etc

I'm using the model loading function from WhipserKit should I change the Logging to print there? Or you had something else in mind?

  • It would be ideal to find a way to not print to the CLI every loop when not in --verbose mode, makes it hard to use as a piped input to other CLI commands (like outputting stdout to a file). Instead, it could only print the new unconfirmed segments, or possibly even replace the current line with currentText for a more live output. Lmk what you think.

changed the state change callback in AudioStreamTranscriber so right now it's going to print only if currentText, unconfirmed or confirmed segments has changed, would that be ok?

atiorh commented 7 months ago

Great work @jkrukowski and @ZachNagengast! We can work on improving the output formatting next week on a separate PR