EricLBuehler / candle-vllm

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.
MIT License
154 stars 14 forks source link

Support stream response #43

Open guoqingbao opened 5 days ago

guoqingbao commented 5 days ago

Open this issue to track the progress of stream response feature in candle-vllm.

guoqingbao commented 5 days ago

Current progress:

66 tokens/s on A100 for LLaMa2 7B (BF16)

Note: there is a problem for candle-vllm release version on certain environments (Rust tokio runtime cpu usage abnormal), tring to fix. Use the debug version instead at the moment.

candle-vllm-demo

guoqingbao commented 3 days ago

The stream generation hang has been addressed, refer to #42 Candle-vllm can now generate 71 tokens/s per request on A100 for LLaMa2 7B (BF16) in release mode, which is very close to vLLM (Pytorch backend) (72 tokens/s).