Const-me / Whisper

High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model
Mozilla Public License 2.0
7.65k stars 664 forks source link

distil-large support request. #187

Open AeneasZhu opened 8 months ago

AeneasZhu commented 8 months ago

An improved branch of Whisper has been released on Hugging Face, and it seems good with faster performance. Can Whisper Desktop add its support? And they have released ggml formats. Here is the link: https://huggingface.co/distil-whisper/distil-large-v2

RickArcher108 commented 8 months ago

I don't understand the interface of that page. It's not clear to me how to download anything. Whereas a page like https://huggingface.co/ggerganov/whisper.cpp/tree/main makes it easy. I use Whisper Desktop and Whisperer. What is the actual link for downloading the new ggml file?

emcodem commented 7 months ago

I am under the impression that it is just about downloading and using the file

ggml-large-32-2.en.bin

But not sure and on the main page of the original from @AeneasZhu we see this note:

Whisper.cpp Coming soon ...

EDIT: looks like i am right and it is about the file i linked above. All seems to be described here: https://github.com/ggerganov/whisper.cpp/pull/1424

Note that currently there is only english and also they talk about that it has degraded quality when being fed with 30s chunks (as whisper cpp and const-me do now) but instead it should be fed with 15 second chunks and a good amount of logic should be added to the inferences in order to workaround problems at the segment borders.

Read about this here: https://github.com/huggingface/distil-whisper

The argument max_new_tokens controls the maximum number of generated tokens per-chunk. In the typical speech setting, we have no more than 3 words spoken per-second. Therefore, for a 15-second input, we have at most 45 words (approx 60 tokens). We set the maximum number of generated tokens per-chunk to 128 to truncate any possible hallucinations that occur at the end of the segment.