-
### Your current environment
```text
The output of `python collect_env.py`
```
### How would you like to use vllm
Hi
I want to attach lora using docker command
docker run --runtime nv…
-
这是我的运行代码:
python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model /home/wangll/llm/model_download_demo/models/Qwen/Qwen2-VL-7B-Instruct
以下是报错信息:
INFO 09-03 1…
-
I notice that there is no --lora-modules argument in the `vllm.entrypoints.api_server`, which means I must add the lora local path when sending request.
That's unrealistic. Because the client does…
-
The question is regarding - https://huggingface.co/blog/agents#self-correcting-retrieval-augmented-generation
The sources of docs in the vector index are ['blog', 'optimum', 'datasets-server', 'dat…
-
`torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.00 MiB. GPU 0 has a total capacty of 44.53 GiB of which 15.25 MiB is free. Including non-PyTorch memory, this process has 44.51 G…
-
硬件环境:4090+i9-14900f
操作系统:ubuntu 22.04
环境:python 3.8,vllm 0.5.5,vllm-flash-attn 2.6.1,transformers 4.45.0.dev0
问题描述:使用conda创建python3.8的环境后使用pip安装了vllm,模型权重文件也下载到了本地。然后执行启动命令”python -m vllm.entrypoin…
-
RTX 4090 24G,
Qwen-7B-Chat
loads OK:
```
model_config = ModelConfig(lora_infos={
"lora_1": conf['lora_1'],
"lora_2": conf['lora_2'],
})
model = ModelFactory.from_huggingface(conf['b…
-
**Describe the bug**
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
The process stops after loading the model into memory and processing dataset. I also tried an…
-
Just FYI, think this is failing because of a LoRA with only certain blocks trained:
```
File "flux-fp8-api/flux_pipeline.py", line 163, in load_lora
self.model = lora_loading.apply_lora_to_…
-
/kind feature
**Describe the solution you'd like**
[A clear and concise description of what you want to happen.]
There are different directions:
- extend existing API for referencing multiple …