EricLBuehler / candle-vllm

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.
MIT License
172 stars 16 forks source link

Support chat serving for more models #44

Open guoqingbao opened 1 week ago

guoqingbao commented 1 week ago

Open this issue for tracking the progress of models supported in candle-vllm.

guoqingbao commented 1 week ago

Phi3 model is added in this PR #45

Command line to run Phi3 3.8B chat service

cargo run --release -- --port 2000 --weight-path /home/phi3-3.8b/ phi3 --repeat-last-n 64

It uses mixed precision (F32 for rope/rmsnorm & BF16 for others) for long sequence generation (e.g., prompt over 2k tokens). Tested speed on A100: 99 tokens/s for decoding

You may run Phi3 7B with different weight-path since the pipeline loads models using the corresponding config.json (I haven't tested Phi3 7B, but it should be workable in theory).

guoqingbao commented 6 days ago

Qwen2 model is added in this PR #46

Command line to run Qwen2 1.8B chat service

cargo run --release -- --port 2000 --weight-path /home/qwen2-1.8b/ qwen2 --repeat-last-n 64

or

cargo run --release -- --port 2000 --model-id Qwen/Qwen1.5-1.8B-Chat qwen2 --repeat-last-n 64

Tested speed on A100: ~150 tokens/s for decoding