Closed ggerganov closed 10 months ago
What is a good way to check this is working? For testing purposes to implement this.
I think we need some documentation on how to use ggml, as ggml's API is quite hard to understand. This way, more people can get started quickly, just like with PyTorch. @ggerganov
I agree. Actually simple example programs would be even better as they are easier to maintain long term.
I just need to find the time..
When using beam search, we currently run the decoders sequentially:
https://github.com/ggerganov/whisper.cpp/blob/f1c9df58064e234b8bd5bd41a59530b675dd2ffe/whisper.cpp#L4416-L4444
This is multiple times slower compared to a batched evaluation. This inefficiency is the major factor preventing efficient usage of beam search in
whisper.cpp
and thus often resulting in bad transcription quality.Batched inference has been demonstrated in
llama.cpp
:https://github.com/ggerganov/llama.cpp/blob/bd34cdde38f8fd661890ddd5f57ca30bf279877b/examples/baby-llama/baby-llama.cpp#L768-L777
This can be a starting point for doing the same in
whisper.cpp
and achieving efficient beam search implementation