argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
http://argmaxinc.com/blog/whisperkit
MIT License
3.92k stars 331 forks source link

Reduce redundant decoder forward passes by leveraging word-level timestamps #59

Closed atiorh closed 7 months ago

atiorh commented 8 months ago

The goal is to leverage the high-quality word-level timestamps added in #38 as anchors to reliably seek the audio buffer forward at a higher frequency compared to current behavior:

atiorh commented 8 months ago

References: [1] https://arxiv.org/abs/2005.11185 [2] https://arxiv.org/abs/2307.14743