-
Apoorv, do you have plans for a paper or a technical report for prompt lookup decoding?
I know you've indicated that people should cite your GitHub repo, but it would be nice to have something out …
-
### Your current environment
Idk how to run it inside a docker
### 🐛 Describe the bug
Simply run the following command
`docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.ca…
-
### 🚀 The feature, motivation and pitch
Speculative decoding allows emitting multiple tokens per sequence by speculating future tokens, scoring their likelihood using the LLM, and then accepting each…
-
## Overview
Speculative decoding allows a speedup for memory-bound LLMs by using a fast proposal method to propose tokens that are verified in a single forward pass by the larger LLM. Papers report 2…
-
I have some questions about the structure of custom mask for lookahead and verify branches [as described in the blog](https://lmsys.org/blog/2023-11-21-lookahead-decoding/#lookahead-and-verify-in-the…
-
## 🚀 Feature
Please add Medusa decoding in mlc-llm in C++, we urgently needed it to speedup LLM decoding on mobile device.
refers to: https://github.com/FasterDecoding/Medusa/tree/main
Medusa adds …
-
### Feature request
I would like to request [llama.cpp](https://github.com/ggerganov/llama.cpp) as a new model backend in the transformers library.
### Motivation
llama.cpp offers:
1) Exce…
-
Hi FlexFlow team,
I used the methods mentioned in #1099 to test the latency(GPU: RTX-4090), but i get a confused result:
1)LLaMA-7B + 1个SSM(llama-160M), latency: 25.1 s
2)LLaMA-7B(without ssms), la…
-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS…
-
### Your current environment
**But this is my host env, the Engine is running on the official latest docker image.**
```text
Collecting environment information...
PyTorch version: N/A
Is debug …