Speculative decoding support with Eager streaming mode

argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon

https://takeargmax.com/blog/whisperkit

MIT License

3.17k stars 268 forks source link

Speculative decoding support with Eager streaming mode #102

Open atiorh opened 5 months ago

atiorh commented 5 months ago

The Eager streaming mode implies that we predict the same token at least twice. This is a great opportunity to design a speculative decoding technique that can leverage a fast draft model* and amortize the redundant predictions while accelerating the overall pipeline.

Draft: distil-large-v3, Oracle: large-v3. They share AudioEncoders, only TextDecoders are different