-
## v0.3.0 openai.api_server fails for Mixtral-8x7B: FileNotFoundError
### Description
* vLLM v0.3.0 openai.api_server fails for Mixtral-8x7B: FileNotFoundError
* vLLM v0.2.7 openai.api_server w…
-
Microsoft have claimed that ”Splitwise“ is supported in vLLM, see
https://www.microsoft.com/en-us/research/blog/splitwise-improves-gpu-usage-by-splitting-llm-inference-phases/
![image](https://githu…
-
### Your current environment
```text
The output of `python collect_env.py`
```
### How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how …
-
Hi, I am trying to evaluate the model RLHFlow/LLaMA3-iterative-DPO-final with MT Bench. I use the inference environment in ReadME and follow the scripts from https://github.com/lm-sys/FastChat/tree/ma…
-
I used [Skypilot docs](https://skypilot.readthedocs.io/en/latest/examples/docker-containers.html) and [Mistral docs](https://docs.mistral.ai/self-deployment/skypilot/) to create this YAML:
```
res…
-
## Description
Do you intend to add [Attention Sinks](https://github.com/huggingface/transformers/commit/633215ba58fe5114d8c8d32e415a04600e010701) streaming as an alternative to the current impleme…
-
Hello!
Thank you for releasing this extensive code base!
I was wondering is there anyway to avoid ray when running some of the attacks like PAIR on a single node? (Ray is unusable on my end).
…
-
One feature that would be great is Langchain support for agents or chains. Even if it is a LangServe Remote Runnable it would be awesome to be able to leverage Langchain agents / tools...etc.
-
As per the title, the completions API is invoked with max_tokens = 0, which, if properly interpreted by the server, will cause it not to generate anything (according to the [API documentation](https:/…
-
So we're having issues inferencing efficiently at scale, and of course we're processing the audio parts one by one as is default for inference, but is there any support for batch inference to speed th…