-
### Your current environment
Referring to the issue #5181 "The Offline Inference Embedding Example Fails", the method LLM.encode() can only work for embedding models. Is there any idea to get the ou…
-
currently I load openai api server using the command
python3 -m vllm.entrypoints.openai.api_server --model Llama3-8B-Instruct --dtype auto --host 0.0.0.0 --port 8051 --gpu-memory-utilization 0.8 --en…
-
Hi team,
I would like to use the LogitsPostProcessor in the [C++ Executor API](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/executor/executor.h) to control the generatio…
-
### Your current environment
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu Jammy Jellyfish (development branch…
-
### Your current environment
Collecting environment information...
PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubunt…
-
Hi,
I just came from axolotl and I'm impressed! I get 10x faster phi3-mini-4k and I can run the adapter in VLLM (couldn't with axolotl, vLLM says that lm_head module is not supported).
**Questi…
-
### Your current environment
Hello,
when the Python Wheel is installed according to your documentation:
https://docs.vllm.ai/en/latest/getting_started/installation.html#install-with-pip
The imag…
ch9hn updated
3 months ago
-
### Your current environment
```
(vllm-gptq) root@k8s-master01:/workspace/home/lich/QuIP-for-all# pip3 list | grep aphrodite
aphrodite-engine 0.5.3 /workspace/home/lich/aphrodite-eng…
-
Some users may need to send batch requests with several prompt/schema pairs. It is possible to do this with the vLLM server integration using `aiohttp`, and we should document this.
rlouf updated
4 months ago
-
Just found a recent blog https://vllm.ai/ and repo https://github.com/vllm-project/vllm that implements paged attention. Tested this out and it provides massive throughput and memory efficiency improv…