whisper : implement batched decoding

ggerganov commented 1 year ago

When using beam search, we currently run the decoders sequentially:

https://github.com/ggerganov/whisper.cpp/blob/f1c9df58064e234b8bd5bd41a59530b675dd2ffe/whisper.cpp#L4416-L4444

This is multiple times slower compared to a batched evaluation. This inefficiency is the major factor preventing efficient usage of beam search in whisper.cpp and thus often resulting in bad transcription quality.

Batched inference has been demonstrated in llama.cpp:

https://github.com/ggerganov/llama.cpp/blob/bd34cdde38f8fd661890ddd5f57ca30bf279877b/examples/baby-llama/baby-llama.cpp#L768-L777

This can be a starting point for doing the same in whisper.cpp and achieving efficient beam search implementation

fire commented 1 year ago

What is a good way to check this is working? For testing purposes to implement this.

bobqianic commented 1 year ago

I think we need some documentation on how to use ggml, as ggml's API is quite hard to understand. This way, more people can get started quickly, just like with PyTorch. @ggerganov

ggerganov commented 1 year ago

I agree. Actually simple example programs would be even better as they are easier to maintain long term.

I just need to find the time..

ggerganov / whisper.cpp

whisper : implement batched decoding #1048