Implement speculative decoding

TabbyML / tabby

Self-hosted AI coding assistant

https://tabby.tabbyml.com/

Other

20.98k stars 952 forks source link

Implement speculative decoding #732

Closed wsxiaoys closed 3 months ago

wsxiaoys commented 10 months ago

Code location: https://github.com/TabbyML/tabby/blob/main/crates/llama-cpp-bindings/src/engine.cc Reference: https://github.com/ggerganov/llama.cpp/blob/master/examples/speculative/speculative.cpp#L47

Implement speculative decoding to speed up certain models.

Squadrick commented 9 months ago

I've started working on this. I'll have a draft CL with the implementation to make sure I have the logic right. Might need some help on what the interface changes to TextInferenceEngine will look like to make this possible.

Also, right now, the decoding in entirely greedy, should I continue to use greedy decoding for the speculative model as well?

wsxiaoys commented 9 months ago

Thanks for claiming the feature!

Also, right now, the decoding in entirely greedy, should I continue to use greedy decoding for the speculative model as well?

Yes - greedy decoding shall be good for now