-
**What would you like to be added**:
LeaderWorkerSet should support heterogenous resource requirements across Workers.
**Why is this needed**:
In the use case of disaggregated serving there m…
-
We sincerely appreciate your constructive feedback of our paper, and we will make certain updates w.r.t. your concerns in our revised version. Below are our responses to your concerns.
Q1: Comparis…
-
### Your current environment
docker image: vllm/vllm-openai:0.4.2
Model: https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ
GPUs: RTX8000 * 2
### 🐛 Describe the bug
The model works f…
-
I meet coredump when decoding with multi-thread. It cored in rust function `tokenizers_decode`,rust/src/lib.rs:199. here is the core backtrack.
why does it do not support multi-thread? I think dec…
-
[X] I have checked the [documentation](https://docs.ragas.io/) and related resources and couldn't resolve my bug.
**Describe the bug**
I am unable to create test data set, using Ollama models , it…
-
### Your current environment
```text
The output of `python collect_env.py`
```
### 🐛 Describe the bug
Recently, we have seen reports of `AsyncEngineDeadError`, including:
- [ ] #5060
…
-
Envoy supports sending the full request body to the external authorization server via the with_request_body filter configuration. Do you think that it is possible to expose such feature on the Securit…
-
Hi there, I've been following this work for a few months and found it's really an amazing idea to run LLMs over the Internet, while I'm also trying to improve Petals' performance on model inference in…
-
This is on an M3 MacBook Pro.
1. I'm following the guide, I had Ollama set up and running already, I'm serving a llama3 variant which I tested. Listening on the 1st terminal window.
2. I configured …
-
Hey,
Currently, Ollama is saving models locally on a cache. To maintain different versions of LLMs or finetuned ones and also for extensive monitoring it's a good idea to provide integration with M…