-
### Your current environment
click here to view the env
```
Collecting environment information...
PyTorch version: 2.6.0.dev20241008+cu124
Is debug build: False
CUDA used to build PyTorch:…
-
model = LLM(model=model_name, max_model_len=4096, trust_remote_code=True,gpu_memory_utilization=0.6,tensor_parallel_size=2)
File "/lib/python3.10/site-packages/vllm/executor/multiproc…
-
### Anything you want to discuss about vllm.
flashinfer already supports sliding window in https://github.com/flashinfer-ai/flashinfer/issues/159 , and we should update our code and pass sliding wind…
-
Hi,
I have finetuned Qwen2-VL using Llama-Factory.
I successfully quantized the fine-tuned model as given
```
from transformers import Qwen2VLProcessor
from auto_gptq import BaseQuantizeC…
-
### Describe the issue as clearly as possible:
When using vllm and outlines, when running it from a VM, it seems that the diskcache functionality is not working correctly. Every time the server is …
Lap1n updated
1 month ago
-
### 🚀 The feature, motivation and pitch
Currently we host vllm wheels in aws, and ask users to install wheels via a long link:
`pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/v…
-
### Your current environment
Name: vllm
Version: 0.6.3.post2.dev171+g890ca360
### Model Input Dumps
_No response_
### 🐛 Describe the bug
I used the interface from this vllm repository …
-
`CUDA_VISIBLE_DEVICES=0,1 lm_eval --model vllm \
--model_args pretrained=/home/jovyan/data-vol-1/models/meta-llama__Llama3.1-70B-Instruct,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=…
-
### Your current environment
The output of `python collect_env.py`
```text
Your output of `python collect_env.py` here
```
### Model Input Dumps
_No response_
### 🐛 Describe the bug
…
-
This issues describes the high level directions that "create LLM Engine V2". We want the design to be as transparent as possible and created this issue to track progress and solicit feedback.
Goal…