-
Hi, thanks for the great work!
What if I want to support larger model, say, beyonds one gpu card's memory and needs tp. Is there a reason why qserve [doesn't support tp](https://github.com/mit-han-…
-
People are curious about LLMs. It would be nice if we could go through the lifecycle that we expect other groups with large data corpi to go through. We have Terabytes of github data, the textual na…
-
Hi, can you please provide a guide or support to use local llm models like Ollama lama3.1 8b or 70b
-
Currently, every GitHub project and specially the ones that come under CNCF use independent processes for issue triage, bot replies and so on. At a broad level, the following patterns arise where proj…
-
### Your current environment
docker with vllm/vllm-openai:v0.4.3 (latest)
### 🐛 Describe the bug
python3 -m vllm.entrypoints.openai.api_server --model ./Qwen1.5-72B-Chat/ --max-model-len 2400…
-
Is there comparison performance data between ScaleLLM and vLLM
-
Good job!
Hope to see comparisons with different frameworks on some models, such as throughputs, first token speed, etc.
-
Co-authored with @SolitaryThinker @Yard1 @rkooo567
We are landing multi-step scheduling (#7000) to amortize scheduling overhead for better ITL and throughput. Since the first version of multi-step…
-
### The vllm docker image is
`intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1`
### vLLM start command is
'model="/llm/models/Qwen2-72B-Instruct/"
served_model_name="Qwen2-72B…
-
# repo链接
https://github.com/THUDM/ChatGLM-6B
https://github.com/mymusise/ChatGLM-Tuning
https://github.com/LianjiaTech/BELLE
## LLM量化
https://zhuanlan.zhihu.com/p/616969812
- [SmoothQuant](htt…