Infini-AI-Lab / Sequoia

scalable and robust tree-based speculative decoding algorithm
280 stars 29 forks source link

The support on vLLM? #11

Open KexinFeng opened 2 months ago

KexinFeng commented 2 months ago

Hi,

I remember the support on vLLM was on your TODOs. Have you achieved it now? Was the main challenge in this direction that the batch size > 1 tree verification is hard to made efficient? Thanks!

dreaming-panda commented 2 months ago

Currently we have not added support for vLLM and are working to build a tensor parallelism system. With batch size > 1, we need to solve some additional problems, such as the #accepted tokens can be different for each request in the same batch. And the communication time is not considered in current implementation. After we build the tensor parallelism system, we will make it compatible with vLLM or other inference engines. Thank you!