Closed jkrukowski closed 5 months ago
Prediction side of this looks great, nicely done. I left a few comments on how it could be cleaned up a bit.
There is one missing part that is up to you if you want to include in this PR or add later, which is the actual chunking strategy for a continuous audio array. We will want to use the VAD and select breakpoints as close to 30s that does not cut off any speech in progress, with perhaps a threshold for how much silence to aim for to make our cuts (lots of experimentation to do here). Note that if you choose to do that in a later PR, then some of the comments can be skipped below, and this can simply be multiple file batching. Without a chunking strategy that uses VAD then we wouldn't want to chunk single files at all because it would degrade accuracy.
I'd rather leave VAD for a separate PR if that's ok
Should partially resolve https://github.com/argmaxinc/WhisperKit/issues/97
concurrentWorkerCount
responsible for controlling no of concurrent tasksTranscriber
protocol, madeWhipserKit
an open classTranscribeTask
class responsible for transcribing audio chunk to text and moved all the logic thereThis is how signposts look like in Instruments (for processing 5 audio files):
Some benchmarks on my MacBook Air M1 (running in the release mode using tiny model
time swift run -c release whisperkit-cli transcribe [...]
):using
Alice.mp3
file provided by @ZachNagengast:API changes
Deprecations
WhisperKit
1.
Deprecated
use instead
2.
Deprecated
use instead
TextDecoding
1.
Deprecated
use instead
2.
Deprecated
use instead
Breaking changes
Transcriber
protocolAudioProcessing
becomes
AudioStreamTranscriber
becomes
TextDecoding
becomes