speculative-decoding Search Results

889 results
for speculative-decoding

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

vllm-project/vllm #6545

[Feature]: mistralai/Mistral-Nemo-Instruct-2407 support

### 🚀 The feature, motivation and pitch Apparently outperforms Mixtral at a smaller size. Longer context length and multilingual. https://github.com/mistralai/mistral-inference/#deployment for Docke…

bjoernpl updated 1 day ago
4
vllm-project/vllm #5708

[Feature]: Continuous streaming of `UsageInfo`

### 🚀 The feature, motivation and pitch **TLDR:** We would like an option we can enable to continuously stream the `UsageInfo` when using the streaming completions API. This solves a number of "acc…

tdoublep updated 1 week ago
1
vllm-project/vllm #5015

[Help wanted] [Spec decode]: Increase acceptance rate via Me…

### 🚀 The feature, motivation and pitch Speculative decoding allows emitting multiple tokens per sequence by speculating future tokens, scoring their likelihood using the LLM, and then accepting each…

cadedaniel updated 1 month ago
1
vllm-project/vllm #6169

[Bug]: TypeError: 'NoneType' object is not callable when loa…

### Your current environment Idk how to run it inside a docker ### 🐛 Describe the bug Simply run the following command `docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.ca…

DanielusG updated 1 week ago
10
vllm-project/vllm #4632

[Performance] [Speculative decoding]: Support draft model on…

## Overview Speculative decoding allows a speedup for memory-bound LLMs by using a fast proposal method to propose tokens that are verified in a single forward pass by the larger LLM. Papers report 2…

cadedaniel updated 3 weeks ago
5
vllm-project/vllm #5540

[Feature]: LoRA support for Mixtral GPTQ and AWQ

### 🚀 The feature, motivation and pitch Please consider adding support for GPTQ and AWQ quantized Mixtral models. I guess that after #4012 it's technically possible. ### Alternatives _No r…

StrikerRUS updated 2 weeks ago
3
THUDM/CodeGeeX4 #23

vllm加载模型之后没推理，一直满GPU占用，是怎么回事？

代码如下: ``` from transformers import AutoTokenizer from vllm import LLM, SamplingParams max_model_len, tp_size = 131072, 1 model_name = "/models/codegeex4-all-9b" tokenizer = AutoTokenizer.from_pr…

luguoyixiazi updated 5 days ago
4
mlc-ai/mlc-llm #2319

[Feature Request] Medusa support

## 🚀 Feature Please add Medusa decoding in mlc-llm in C++, we urgently needed it to speedup LLM decoding on mobile device. refers to: https://github.com/FasterDecoding/Medusa/tree/main Medusa adds …

EmilioZhao updated 1 month ago
8
vllm-project/vllm #5360

[Bug]: Multi GPU setup for VLLM in Openshift still does not …

### Your current environment ```text Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A …

jayteaftw updated 1 month ago
17
sgl-project/sglang #157

Development Roadmap (Deprecated)

## Function Calling - Frontend - Add `tools` argument in `sgl.gen`. See also guidance [tools](https://github.com/guidance-ai/guidance/blob/d1bbe1c698cbb201f89556d71193993e78c0686b/README.md?plai…

Ying1123 updated 3 days ago
16

上一页 1...16 17 18 19 20 21 22...89 下一页

889 results for speculative-decoding

889 results
for speculative-decoding