argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
http://argmaxinc.com/blog/whisperkit
MIT License
3.96k stars 334 forks source link

No Speech Detection #27

Open ZachNagengast opened 9 months ago

ZachNagengast commented 9 months ago

This can be done with logit filters on the first loop, similar to detecting language. However, this cannot be used when we are using a prefill prompt (i.e. forced decoder tokens) so that will need special handling. Ideally, there'd be an option to ignore the prefill prompt for the first decoder loop to detect no speech, which costs 1 extra loop but may allow skipping the entire window if developers are expecting some long stretches of silence in their input audio.

References

Openai implementation: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L692-L693 WhisperKit inline todo: https://github.com/argmaxinc/WhisperKit/blob/228630c37e4ac1b1c95790d77f64058d317f8859/Sources/WhisperKit/Core/TextDecoder.swift#L497 https://github.com/argmaxinc/WhisperKit/blob/228630c37e4ac1b1c95790d77f64058d317f8859/Sources/WhisperKit/Core/WhisperKit.swift#L612-L616

aigerimmmm commented 5 months ago

Hi, can I work on this issue?

ZachNagengast commented 5 months ago

Absolutely! @aigerimmmm all yours