Support chat serving for more models

EricLBuehler / candle-vllm

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

MIT License

271 stars 28 forks source link

Support chat serving for more models #44

Open guoqingbao opened 5 months ago

guoqingbao commented 5 months ago

Open this issue for tracking the progress of models supported in candle-vllm.

guoqingbao commented 5 months ago

Phi3 model is added in this PR #45

Command line to run Phi3 3.8B chat service

cargo run --release -- --port 2000 --weight-path /home/phi3-3.8b/ phi3 --repeat-last-n 64

It uses mixed precision (F32 for rope/rmsnorm & BF16 for others) for long sequence generation (e.g., prompt over 2k tokens). Tested speed on A100: 99 tokens/s for decoding

You may run Phi3 7B with different weight-path since the pipeline loads models using the corresponding config.json (I haven't tested Phi3 7B, but it should be workable in theory).

guoqingbao commented 5 months ago

Qwen2 model is added in this PR #46

Command line to run Qwen2 1.8B chat service

cargo run --release -- --port 2000 --weight-path /home/qwen2-1.8b/ qwen2 --repeat-last-n 64

cargo run --release -- --port 2000 --model-id Qwen/Qwen1.5-1.8B-Chat qwen2 --repeat-last-n 64

Tested speed on A100: ~150 tokens/s for decoding

guoqingbao commented 4 months ago

Mistral, Yi and StableLM are supported in #53 #57

Running cases:

cargo run --release -- --port 2000 --weight-path /home/mistral_7b/ mistral --repeat-last-n 32 --penalty 1.1 --
temperature 0.8

cargo run --release -- --port 2000 --weight-path /home/yi-6b/ yi --repeat-last-n 32

cargo run --release -- --port 2000 --weight-path /home/stablelm-zephyr-3b/ stable-lm --repeat-last-n 32

guoqingbao commented 4 months ago

LLaMa3/LLaMa3.1 supported in #67

Tested case:

cargo run --release -- --port 2000 --weight-path /home/Meta-Llama-3.1-8B-Instruct/ llama3 --repeat-last-n 64

65 tokens/s on A100 (BF16).

guoqingbao commented 3 months ago

We have added support for quantized models, refer to #77

EricLBuehler commented 3 months ago

@guoqingbao nice work with #77!

guoqingbao commented 3 months ago

@guoqingbao nice work with #77!

I'm planning to parallelize the model loading process, specifically for in-situ quantization. The current strategy of loading model weights (bought from candle) layer by layer is unnecessary and inefficient.