-
Hi, I want to use the vllm during the evaluation. But when I set --vllm, it shows the OOM error. My GPU is A6000 and the model for evaluation is 7B. I can evaluate my model on mt-benchmark with vllm. …
-
Chu has merged inference code for models quantized by QuIP# into vllm(https://github.com/chu-tianxiang/vllm-gptq), but now the inference code only supports tensor_parallel_size=1. The reason is "Ha…
-
### System Info
Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: N/A
Docker label: N/A
nvidia-smi:
```
+-----------------------------------------------------------------…
-
Open this issue for tracking the progress of models supported in candle-vllm.
-
Running vllm according to instructions. Docker segfaults at startup, so I'm running straight on the machine.
Starting server with the following shell script. As you can see I've tried to turn max…
-
Are there docs on best practices for using vllm hosted models?
I create a model using
python -m vllm.entrypoints.openai.api_server --model model_path
and try running it as
lm_eval --model lo…
-
I want to try DSPy using a local LLM served using vLLM. I followed the instructions from https://dspy-docs.vercel.app/docs/deep-dive/language_model_clients/local_models/HFClientVLLM The model was down…
-
### Motivation
Speculative decoding can speed up generation more than 2x. This degree of speedup is an important feature for a production-grade LM deployment library, and it seems the methods are s…
-
### 🥰 Feature Description
能支持vllm吗?
### 🧐 Proposed Solution
能支持vllm吗?
### 📝 Additional Information
_No response_
-
Currently, vLLM's `vllm.worker.worker.Worker` is replaced with `openrlhf.trainer.ray.vllm_worker_wrap.WorkerWrap` on fly as a monkey patch.
The monkey patch is avoidable by making `init_process_gro…