-
### Proposal to improve performance
I am using vllm version 0.6.3.post1 with four 4090 GPUs to infer the qwen2-72B-chat-int4 model. The request speed is very fast for a single request, but the perf…
ljwps updated
2 weeks ago
-
Currently, the model definition in trt-llm is mainly manually built through TensorRT's API or plugins. While this provides flexibility, an optional tracing based (mainly onnx) solution could enable s…
-
Explaining and demonstrating the use of tpot library that can be used to find the best model with the best parameters for classification and regression task without much efforts.
Please assign this…
-
I did some tests in order to find better parameter to speed up, and it appears that there hasn't been a significant change in TTFT (Time To First Token). Is my TTFT correct? I feel it might be a bit t…
-
### Report of performance regression
Using your benchmark
```
git clone https://github.com/vllm-project/vllm
cd vllm/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vi…
-
when I using Intel(R) Core(TM) Ultra 5 125H to test, npu is so slowly?
```
install npu driver follow this: https://github.com/intel/linux-npu-driver/blob/main/docs/overview.md
pip install optim…
-
When I try to run it through Windows on the Docker machine, it gives this error. However, I updated the python I'm running. Python is currently version 3.11.4 and still presents this error. The docker…
-
### Your current environment
```
vllm 0.5.3.post1+gaudi117
```
tensor_parallel_size=1 script
```text
export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
export VLLM_GRAPH_…
-
### Your current environment
4xH100.
### Model Input Dumps
_No response_
### 🐛 Describe the bug
When benchmarking the performance of vllm with `benchmark_serving.py`, it will generate different…
-
Use teapot feature selection strategy on tsflex generated features.