speculative-decoding Search Results

800 results
for speculative-decoding

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

vectorch-ai/ScaleLLM #84

ScaleLLM Roadmap

We're excited to present the features we're currently working on and planning to support in this roadmap document. Your feedback is highly valued, so please don't hesitate to comment or reach out if y…

guocuimi updated 3 weeks ago
3
huggingface/optimum-neuron #347

Assistive Generation for optimum.neuron on inf2

I want to know how we can run the speculative decoding(Assisted Generation) to increase the token/sec for llama2 based model for optimum.neuron to run on inf2. Similar to what transformers have done f…

Manoj-Data-Science updated 3 weeks ago
9
vllm-project/vllm #5708

[Feature]: Continuous streaming of `UsageInfo`

### 🚀 The feature, motivation and pitch **TLDR:** We would like an option we can enable to continuously stream the `UsageInfo` when using the streaming completions API. This solves a number of "accou…

tdoublep updated 1 week ago
1
manisnesan/til #54

Speculative Sampling

Speculative sampling is a technique used in machine learning and natural language processing. It involves generating multiple possible outputs or continuations of a sequence using a draft model, and t…

manisnesan updated 10 months ago
1
sgl-project/sglang #555

Will speculative decoding be supported?

Hi, Thanks for this great repo. As the main branch has vllm 0.5.0, i was wondering if speculative decoding will be supported in sglang also. Right now i am getting the following error: laun…

arunpatala updated 1 week ago
3
flashinfer-ai/flashinfer #152

QUESTION: How to implement a tree attention with flashinfer

Hi, thanks for your awesome work! I'm trying to implement https://github.com/SafeAILab/EAGLE with high-performance kernels. I read [this blog](https://flashinfer.ai/2024/02/02/introduce-flashinfer.…

UranusSeven updated 1 week ago
10
vllm-project/vllm #1802

Feature request: prompt lookup decoding

Prompt lookup decoding (PLD) is a variant of speculative decoding that replaces the draft model with a prefix lookup in the current sequence, resulting in a 2-4x throughput boost for input-grounded ta…

kevinhu updated 2 months ago
2
huggingface/candle #2153

`broadcast_as` error when processing multiple tokens at onc…

Hello all, Thanks for your great work here. We are implementing speculative decoding at mistral.rs, and were in the final stages of testing when we discovered some incredibly strange behavior. Spec…

EricLBuehler updated 1 month ago
9
pytorch-labs/gpt-fast #21

Speculative decoding slows model down, possibly from "skippi…

### Some context I am using AMD MI100 GPUs and I can get ~33 tokens/second for Llama 2 70B using - compile - tensor parallelism of 8 - int8 quantization ``` time torchrun --standalone --npr…

jamestwhedbee updated 3 months ago
7
Dao-AILab/flash-attention #924

Any plans to support tree attention mask?

Tree attention mask is already supported in huggingface/transformers: https://github.com/huggingface/transformers/pull/27539 It will be very helpful for the speculative decoding applications. More se…

KexinFeng updated 1 month ago
6

上一页 1...2 3 4 5 6 7 8...80 下一页

800 results for speculative-decoding

800 results
for speculative-decoding