-
### Your current environment
my vllm version is
pip show vllm
Name: vllm
Version: 0.3.3+git3380931.abi0.dtk2404.torch2.1
Summary: A high-throughput and memory-efficient inference and serving eng…
-
**Is your feature request related to a problem? Please describe.**
I'm running some benchmarks with Python SDK and profiling the code to understand more about its execution. Here's the profile report…
-
The `README.md` says "more efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (not provided in this repo)".
Will this eventually be integrated into this repo, too? That w…
-
![image](https://github.com/user-attachments/assets/c1a77b7e-049f-4d87-9ac9-ff71098462d1)
Thank you for your insightful work on γ-MoD. I have a question regarding Figure 4 in your paper.
Could y…
-
Hi! !مرحبا! السلام عليكم
Let's bring the documentation to all the Arabic-speaking community 🌏 (currently 0 out of 267 complete)
Would you want to translate? Please follow the 🤗 [TRANSLATING guid…
-
## Background
The `bandrobot` test, which is one of a demo in ONA, is aiming to test the multistep event inferencing/subgoaling of ONA reasoner (by NAL-7 & NAL-8 temporal/procedural inferencing)
…
-
Add Stan PPL integration to use Stan models with Blackjax inference algorithms
With the [BridgeStan](https://roualdes.github.io/bridgestan/latest/) library, we can efficiently access log density an…
-
- Description:
- The autoregressive decoding mode of LLM determines that LLM can only be decoded serially, which limits its inference speed. Speculative decoding technique can be used to decode L…
-
### Description
The kibana team has requested that we add pagination and sorting options to the`GET _inference/_all` API to efficiently handle these operations in the backend. Currently, they have ad…
-
### Has this been supported or requested before?
- [X] I have checked [the GitHub README](https://github.com/QwenLM/Qwen2.5).
- [X] I have checked [the Qwen documentation](https://qwen.readthedocs…