ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.96k stars 9.31k forks source link

Implementation of speculative streaming #5620

Closed NickNickGo closed 5 months ago

NickNickGo commented 6 months ago

This might be of interest :

https://huggingface.co/papers/2402.11131

ggerganov commented 6 months ago

Yes, it is of interest. The tree-based decoding is already fully supported. The speculative streams and multi-stream attention layers should be possible to support, but I would need an actual model to test with. Not sure if they have released one yet

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.