-
### Your current environment
vllm=0.6.1
### Model Input Dumps
CUDA_VISIBLE_DEVICES=7 python3 -m vllm.entrypoints.openai.api_server --port 8010 \
--served-model-name qwen2-7b \
--model /mn…
-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
…
-
I am trying to decode some sstable files with following two steps:-
Indexing step:-
hadoop jar hadoop-sstable-0.1.4.jar com.fullcontact.sstable.index.SSTableIndexIndexer /data//cassandra-data
Decodi…
-
#### Description:
I am experiencing an unexpected spike in GPU memory usage when loading the `Meta-Llama-3.1-8B-Instruct-AWQ-INT4` model using the vLLM framework. Initially, the GPU memory usage is…
-
1. Is speculative decoding faster than faster-whisper?
2. Is there going to be support anytime soon for speculative decoding in faster-whisper?
Both of these questions are asked with a purely…
-
I want to develop some features based on Sglang to improve the performance of srt.
1. A new scheduler of ControllerMulti that can more accurately identify the resource utilization of each instance a…
-
### 🚀 The feature, motivation and pitch
FlexAttention was proposed as a performant attention implementation leveraging `torch.compile` with easy APIs for adding support for complex attention varian…
mgoin updated
3 weeks ago
-
### System Info
Transformers Version: 4.42.0
Python environment: 3.10.14
### Who can help?
@sanchit-gandhi
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### …
-
Dear members of the Ray team,
I am working with DRL algorithms using rllib. I am configuring and testing multiple experiments using the Tune API (tune.run()) as well as the different implemented DR…
-
### Your current environment
vLLM version: v0.6.0 (CPU)
CPU: AMD EPYC 9654
### 🐛 Describe the bug
vLLM v0.6.0 (CPU) server failed to start on setting VLLM_CPU_OMP_THREADS_BIND as shown below:
…