Open guoqingbao opened 5 days ago
Current progress:
66 tokens/s on A100 for LLaMa2 7B (BF16)
Note: there is a problem for candle-vllm release version on certain environments (Rust tokio runtime cpu usage abnormal), tring to fix. Use the debug version instead at the moment.
The stream generation hang has been addressed, refer to #42 Candle-vllm can now generate 71 tokens/s per request on A100 for LLaMa2 7B (BF16) in release mode, which is very close to vLLM (Pytorch backend) (72 tokens/s).
Open this issue to track the progress of stream response feature in candle-vllm.