Open guoqingbao opened 5 months ago
Phi3 model is added in this PR #45
Command line to run Phi3 3.8B chat service
cargo run --release -- --port 2000 --weight-path /home/phi3-3.8b/ phi3 --repeat-last-n 64
It uses mixed precision (F32 for rope/rmsnorm & BF16 for others) for long sequence generation (e.g., prompt over 2k tokens). Tested speed on A100: 99 tokens/s for decoding
You may run Phi3 7B with different weight-path since the pipeline loads models using the corresponding config.json (I haven't tested Phi3 7B, but it should be workable in theory).
Qwen2 model is added in this PR #46
Command line to run Qwen2 1.8B chat service
cargo run --release -- --port 2000 --weight-path /home/qwen2-1.8b/ qwen2 --repeat-last-n 64
or
cargo run --release -- --port 2000 --model-id Qwen/Qwen1.5-1.8B-Chat qwen2 --repeat-last-n 64
Tested speed on A100: ~150 tokens/s for decoding
Mistral, Yi and StableLM are supported in #53 #57
Running cases:
cargo run --release -- --port 2000 --weight-path /home/mistral_7b/ mistral --repeat-last-n 32 --penalty 1.1 --
temperature 0.8
cargo run --release -- --port 2000 --weight-path /home/yi-6b/ yi --repeat-last-n 32
cargo run --release -- --port 2000 --weight-path /home/stablelm-zephyr-3b/ stable-lm --repeat-last-n 32
LLaMa3/LLaMa3.1 supported in #67
Tested case:
cargo run --release -- --port 2000 --weight-path /home/Meta-Llama-3.1-8B-Instruct/ llama3 --repeat-last-n 64
65 tokens/s on A100 (BF16).
We have added support for quantized models, refer to #77
@guoqingbao nice work with #77!
@guoqingbao nice work with #77!
I'm planning to parallelize the model loading process, specifically for in-situ quantization. The current strategy of loading model weights (bought from candle) layer by layer is unnecessary and inefficient.
Open this issue for tracking the progress of models supported in candle-vllm.