argmaxinc / WhisperKit

On-device Inference of Whisper Speech Recognition Models for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
2.85k stars 236 forks source link

No Speech Detection #27

Open ZachNagengast opened 4 months ago

ZachNagengast commented 4 months ago

This can be done with logit filters on the first loop, similar to detecting language. However, this cannot be used when we are using a prefill prompt (i.e. forced decoder tokens) so that will need special handling. Ideally, there'd be an option to ignore the prefill prompt for the first decoder loop to detect no speech, which costs 1 extra loop but may allow skipping the entire window if developers are expecting some long stretches of silence in their input audio.

References

Openai implementation: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L692-L693 WhisperKit inline todo: https://github.com/argmaxinc/WhisperKit/blob/228630c37e4ac1b1c95790d77f64058d317f8859/Sources/WhisperKit/Core/TextDecoder.swift#L497 https://github.com/argmaxinc/WhisperKit/blob/228630c37e4ac1b1c95790d77f64058d317f8859/Sources/WhisperKit/Core/WhisperKit.swift#L612-L616

aigerimmmm commented 1 week ago

Hi, can I work on this issue?

ZachNagengast commented 1 week ago

Absolutely! @aigerimmmm all yours