SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
12.67k stars 1.06k forks source link

FYI: FUTO did an ACFT finetune of whisper that works with <30s of audio #1006

Open thiswillbeyourgithub opened 2 months ago

thiswillbeyourgithub commented 2 months ago

Hi,

I just wanted to point out here a model that I found interesting and deserves to be well known IMO.

Here's the relevant part of the README.md:

The Whisper model is composed of two parts: the encoder which takes in 30 seconds of audio, and the decoder which outputs text.

The main source of latency between the model receiving audio and starting to output text is running the encoder. When running on resource-constrained devices such as phones, this latency can be big and it's important to minimize it in applications such as voice input.

One reason the encoder can be so slow is because the encoder input must always be 30 seconds. Even if the speech is 5 seconds long, it's necessary to add 25 seconds of silence and the encoder must "waste" processing time on those 25 seconds of nothing.

It'd be great if we could skip adding silence and just get the encoder to process whatever length of audio we have. In fact, we can and this is what the audio_ctx parameter in whisper.cpp does, which was https://github.com/ggerganov/whisper.cpp/issues/137.

Unfortunately, the model gets surprised by this and freaks out if you mess with this parameter too much. If you set it too low, usually the decoder doesn't know when to stop, and it'll repeat itself forever.

However, this issue can be mitigated by finetuning the model to tolerate dynamic audio context. The next section proposes a way to do this.

Link: https://github.com/futo-org/whisper-acft

This is primarily meant to be used on mobile phones via their keyboard and voice apps. If I understood faster whisper correctly then maybe both approach could be combined in the future for even faster inference.

Feel free to close this of course!