-
### Proposal to improve performance
I am using vllm version 0.6.3.post1 with four 4090 GPUs to infer the qwen2-72B-chat-int4 model. The request speed is very fast for a single request, but the perf…
-
### Report of performance regression
Using your benchmark
```
git clone https://github.com/vllm-project/vllm
cd vllm/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vi…
-
when I using Intel(R) Core(TM) Ultra 5 125H to test, npu is so slowly?
```
install npu driver follow this: https://github.com/intel/linux-npu-driver/blob/main/docs/overview.md
pip install optim…
-
### Your current environment
```
vllm 0.5.3.post1+gaudi117
```
tensor_parallel_size=1 script
```text
export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
export VLLM_GRAPH_…
-
### Your current environment
4xH100.
### Model Input Dumps
_No response_
### 🐛 Describe the bug
When benchmarking the performance of vllm with `benchmark_serving.py`, it will generate different…
-
Explaining and demonstrating the use of tpot library that can be used to find the best model with the best parameters for classification and regression task without much efforts.
Please assign this…
-
When I try to run it through Windows on the Docker machine, it gives this error. However, I updated the python I'm running. Python is currently version 3.11.4 and still presents this error. The docker…
-
### Anything you want to discuss about vllm.
I am profiling TTFT and TPOT on my machine, I could not explain the behavior of TTFT thus opened this issue to seek for advice.
Below figure shows the …
-
Use teapot feature selection strategy on tsflex generated features.
-
I did some tests in order to find better parameter to speed up, and it appears that there hasn't been a significant change in TTFT (Time To First Token). Is my TTFT correct? I feel it might be a bit t…