-
### Your current environment
The output of `python collect_env.py`
```text
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTor…
-
### Your current environment
```text
PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC ve…
-
### Anything you want to discuss about vllm.
Currently we run all CI tests matrix on every single commit in pull requests. The CI cost of the vLLM has been doubling each week as we add more tests a…
-
### Feature request
Allow passing a 2D attention mask in `model.forward`.
### Motivation
With this feature, it would be much easier to avoid cross-context contamination during pretraining and super…
-
- [ ] [Guide to choosing quants and engines : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1anb2fz/comment/kprbduc/)
# Guide to choosing quants and engines : r/LocalLLaMA
**DESCRIPTIO…
-
### Feature request
I would like to request [llama.cpp](https://github.com/ggerganov/llama.cpp) as a new model backend in the transformers library.
### Motivation
llama.cpp offers:
1) Exce…
-
The following program encodes that same ASCII string using a naive approach and using actual `UTF8.encode()`. The naive approach is about `3 times` faster. Could UTF8 be optimized to provide better pe…
-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
…
-
1. Is speculative decoding faster than faster-whisper?
2. Is there going to be support anytime soon for speculative decoding in faster-whisper?
Both of these questions are asked with a purely…
-
### 描述该错误
[ModelCloud/internlm-2.5-7b-chat-gptq-4bit](https://huggingface.co/ModelCloud/internlm-2.5-7b-chat-gptq-4bit) and my code:
```
from vllm import LLM, SamplingParams
# Sample prompts.
p…