-
### Your current environment
```text
python3 collect_env.py
Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM use…
-
### Motivation
KV cache hit rates are probably the biggest performance impact for me, and I recently read:
https://research.character.ai/optimizing-inference/
> To solve this problem, we deve…
-
phi-3-medium-128k-instruct was quantized by autoawq
the quant-config:
> quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
nothing changed in the quantize.py…
-
Vllm uses paged memory and has kernels that perform the generation part of the causal inference.
The computation pattern of generation part - single Q for entire seq len of KV - is very different f…
-
**Describe the bug**
I'm using the OpenAIGenerator to access a vLLM endpoint on runpod. When using a base model like Mistral v0.3 that has not been instruction tuned and so does not have a chat templ…
-
### System Info
ubuntu22.04
one Nvidia A800
driver info: 470.141.10
cuda: 12.3
tensorrt: 9.2.0.5
### Who can help?
_No response_
### Information
- [X] The official example scripts
- …
-
Hi,
Appreciate the great work!
If I want to test other model performance, how to do?
e.g. To test Llama3 405 B, what data format should I pass to your interface?
Thxs!
-
### Your current environment
The output of `python collect_env.py`
```text
Your output of `python collect_env.py` here
```
### 🐛 Describe the bug
Hello,
On a container env I …
-
### Motivation.
In online RL training, vLLM can significantly accelerate the rollout stage. To achieve this, we need weight sync from main training process to vLLM worker process, and then call the e…
-
vLLM had supported awq quantized model yet.
Please add one more params to set --quantization awq