Open guoqingbao opened 1 week ago
Phi3 model is added in this PR #45
Command line to run Phi3 3.8B chat service
cargo run --release -- --port 2000 --weight-path /home/phi3-3.8b/ phi3 --repeat-last-n 64
It uses mixed precision (F32 for rope/rmsnorm & BF16 for others) for long sequence generation (e.g., prompt over 2k tokens). Tested speed on A100: 99 tokens/s for decoding
You may run Phi3 7B with different weight-path since the pipeline loads models using the corresponding config.json (I haven't tested Phi3 7B, but it should be workable in theory).
Qwen2 model is added in this PR #46
Command line to run Qwen2 1.8B chat service
cargo run --release -- --port 2000 --weight-path /home/qwen2-1.8b/ qwen2 --repeat-last-n 64
or
cargo run --release -- --port 2000 --model-id Qwen/Qwen1.5-1.8B-Chat qwen2 --repeat-last-n 64
Tested speed on A100: ~150 tokens/s for decoding
Open this issue for tracking the progress of models supported in candle-vllm.