-
### Your current environment
```text
2024-05-07 01:43:26 (981 KB/s) - ‘collect_env.py’ saved [24877/24877]
Collecting environment information...
PyTorch version: 2.2.1+cu121
Is debug build: F…
-
Greetings, @cipher982!
Currently we are working on the Openvino inference framework, and such benchmarks are critical to understand gaps and differences between our framework and Transformers/ TGI …
-
### 🚀 The feature, motivation and pitch
[Parallel/Jacobi decoding](https://arxiv.org/abs/2305.10427) improves inference efficiency by breaking the sequential nature of conventional auto-regressive …
-
### Your current environment
3MIO:~/vllm$ python collect_env.py
Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM u…
-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
…
-
请问Qwen
modelscope的模型文件与huggingface的模型文件是一致的吗?
推理demo
Qwen-VL# python web_demo_mm.py
出现以下提示:
assert generation_config.chat_format == 'chatml', _ERROR_BAD_CHAT_FORMAT
AssertionError: We det…
-
Currently, multi-LoRA supports only Llama and Mistral architectures. We should extend this functionality to all architectures.
Yi, Qwen, Phi and Mixtral architectures seem to be the most demanded r…
Yard1 updated
3 weeks ago
-
On my RX 6800 I seem to get `RuntimeError: FlashAttention only supports AMD MI200 GPUs or newer.` for some reason, I Googled that GPU and it seems to be RDNA2 like mine but for enterprise. Is this not…
-
### Your current environment
```text
Collecting environment information...
/data/miniconda3_new/envs/vllm/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORM…
-
When I call the api '/v1/chat/completions' of API Server to access vllm_worker server , it response incomplete results, but vllm's api response complete results and model_work server response comp…