This can be done with logit filters on the first loop, similar to detecting language. However, this cannot be used when we are using a prefill prompt (i.e. forced decoder tokens) so that will need special handling. Ideally, there'd be an option to ignore the prefill prompt for the first decoder loop to detect no speech, which costs 1 extra loop but may allow skipping the entire window if developers are expecting some long stretches of silence in their input audio.
This can be done with logit filters on the first loop, similar to detecting language. However, this cannot be used when we are using a prefill prompt (i.e. forced decoder tokens) so that will need special handling. Ideally, there'd be an option to ignore the prefill prompt for the first decoder loop to detect no speech, which costs 1 extra loop but may allow skipping the entire window if developers are expecting some long stretches of silence in their input audio.
References
Openai implementation: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L692-L693 WhisperKit inline todo: https://github.com/argmaxinc/WhisperKit/blob/228630c37e4ac1b1c95790d77f64058d317f8859/Sources/WhisperKit/Core/TextDecoder.swift#L497 https://github.com/argmaxinc/WhisperKit/blob/228630c37e4ac1b1c95790d77f64058d317f8859/Sources/WhisperKit/Core/WhisperKit.swift#L612-L616