-
### Your current environment
```text
The output of `python collect_env.py`
```
### How would you like to use vllm
I I tried deploying `qwen2-vl-7b` using vllm with commands:
```bash
VLLM_WORK…
-
cmake is not successful
```
❯ cmake --version
cmake version 3.21.0
CMake suite maintained and supported by Kitware (kitware.com/cmake).
```
```
mkdir build
cd build
cmake -DCMAKE_INSTA…
-
Hello! I use this simulator for LLM serving, but when I run the following cmd:
```shell
python3 -u main.py --model_name 'gpt3-6.7b' --npu_num 1 --npu_group 1 --npu_mem 24 --dataset 'dataset/share-gp…
-
## Description
ignore_eos_token is commonly used additional parameter to help standardize LLM benchmarks by forcing the requests to generate a consistent output seq len.
-Will this change the c…
-
I can't seem to get this extension to work with LM Studio. I've successfully used my server with other software, so I know the server works.
I have CORS enabled. I'm serving on the local network. I'v…
-
Hello,
Similarly to #3, I've tried reproducing the `demo.py` benchmark on an H100 and an A6000 and I'm also seeing no speedup on these platforms at lower precisions.
It was mentioned this is du…
-
## Description
(A clear and concise description of what the bug is.)
Model artifacts are in the (TRT-LLM) LMI model format:
` aws s3 ls ***
PRE 1/
2024-10-25 14:59:…
-
## Description
djl-serving version: djl-inference:0.26.0-tensorrtllm0.7.1
models:
- meta-llama/Llama-2-7b-chat see: https://huggingface.co/meta-llama/Llama-2-7b-chat (used this report)
- meta-lla…
-
/kind feature
**Describe the solution you'd like**
To autoscale LLM inference services Knative's request level metrics may not be the best scaling metrics as LLM inference is performed at the toke…
-
### System Info
- It worked when following https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/multimodal/README.md to run BLIP2-T5 XXL in single A100 GPU
- However, I have only A30 for servin…