The goal is to leverage the high-quality word-level timestamps added in #38 as anchors to reliably seek the audio buffer forward at a higher frequency compared to current behavior:
Current behavior is to seek the audio forward if <|endoftext|> is generated or max_tokens tokens are generated.
Current behavior results in wasteful compute because each text token is re-decoded until the audio seeks beyond them.
This is up to 29 times redundant (worst case) for a 1 second audio refresh rate and a 30 second audio window for Whisper.
The goal is to leverage the high-quality word-level timestamps added in #38 as anchors to reliably seek the audio buffer forward at a higher frequency compared to current behavior:
<|endoftext|>
is generated ormax_tokens
tokens are generated.