-
Hi ,
Is there any way we can change the decoding logic?
example : like support for speculative sampling or other
-
### Feature request
hello,our models are deploying with TGI(v1.4.3), and we alse want to use lorax. But I find that the tgi version lorax is based on is very different with TGI version v1.4.3。
We …
-
So far, the.pre-commit-config.yml file only calls black and isort. I unintentionally used https://github.pie.apple.com/aiml-oss/ml-recurrent-drafter/blob/main/.pre-commit-config.yaml and discovered th…
-
They claim lookahead decoding provides a 1.5~2x decoding speedup without a speculative model.
Blog post: https://lmsys.org/blog/2023-11-21-lookahead-decoding/
Twitter thread: https://twitter.com/l…
-
Hi, I am trying to implement the speculative decoding from [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318), and below is the code snippet:
`…
-
### Motivation
**PD disaggregation** is said to provide an 2-4x throughput improvement. See [DistServe](https://hao-ai-lab.github.io/blogs/distserve/) for reference. They are planning to integrate th…
-
### Report of performance regression
Hi I use this:
```
server_vllm.py \
--model "/data/models_temp/functionary-small-v2.4/" \
--served-model-name "functionary" \
--dtype=bfloat16 \
-…
-
### 🚀 The feature, motivation and pitch
`MLPSpeculator`-based speculative decoding was recently added in https://github.com/vllm-project/vllm/pull/4947, but the initial integration only covers sing…
-
OctoAI use vLLM as a benchmark to demonstrate how fast they are https://octo.ai/blog/acceleration-is-all-you-need-techniques-powering-octostacks-10x-performance-boost:
| Single User Throughput | Mu…
-
## Related issues
#1364 #1361 #1333
## Description
We proposed the inference implementation refactoring which mainly involves `Pipeline Split` and `Struct Smplification`, and this result some …