-
Currently, vLLM's `vllm.worker.worker.Worker` is replaced with `openrlhf.trainer.ray.vllm_worker_wrap.WorkerWrap` on fly as a monkey patch.
The monkey patch is avoidable by making `init_process_gro…
-
Hi, I have tried to load the Phi3 Medium model (128k), but it fails to work with the current version of VLLM here, is this a version update issue? and when I try the Phi3 Mini 128k, it at least tries …
-
### Motivation
Speculative decoding can speed up generation more than 2x. This degree of speedup is an important feature for a production-grade LM deployment library, and it seems the methods are s…
-
环境:
部署了一个基于 qwen2 72B的vllm openai api server
命令:
llmuses perf --url 'http://127.0.0.1:8000/v1/chat/completions' --parallel 4 --model '/share/modelscope/hub/qwen/Qwen2-72B-Instruct-FP8' --log-eve…
-
Please add one or more params to control logs from RESTful API server - namely in `mii.serve()` function.
You can see as reference `-log-` config params in vLLM: https://docs.vllm.ai/en/latest/servin…
-
### Model/Pipeline/Scheduler description
Lumina-T2X is a text-to-any generation model. Our model is capable of generating multiple modalities, most notably image generation. Currently, our image ge…
-
## User Story: Implement Backend Prometheus Metrics
**As a** backends operator
**I want** to have Prometheus metrics for observability of the vLLM backend
**So that** I can monitor the performance, h…
-
Hi There,
I found openAI() takes base_url as the mandatory argument to initialize which is mentioned in this vLLM documentation.
[https://docs.vllm.ai/en/latest/getting_started/quickstart.html#usi…
-
When running the vLLM server for Functionary v2.5 small, the vLLM throws an error because it does not support Functionary tokenizer. I' reverted back to v2.4 for now, but thought I should bring this i…
-
Hi,
I remember the support on vLLM was on your TODOs. Have you achieved it now? Was the main challenge in this direction that the batch size > 1 tree verification is hard to made efficient? Thanks…