Reduce redundant decoder forward passes by leveraging word-level timestamps

The goal is to leverage the high-quality word-level timestamps added in #38 as anchors to reliably seek the audio buffer forward at a higher frequency compared to current behavior:

Current behavior is to seek the audio forward if <|endoftext|> is generated or max_tokens tokens are generated.
Current behavior results in wasteful compute because each text token is re-decoded until the audio seeks beyond them.
This is up to 29 times redundant (worst case) for a 1 second audio refresh rate and a 30 second audio window for Whisper.

argmaxinc / WhisperKit

Reduce redundant decoder forward passes by leveraging word-level timestamps #59