mscheong01 commented 4 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

provide speculative decoding through server example.

Motivation

Noticed this topic popped up in several comments (1, 2, 3) but it seems we haven't officially opened an issue for it. I'm creating this to provide a space for focused discussion on how we can implement this feature and actually get this started.

Possible Implementation

perhaps move speculative sampling implementation to common or sampling?

vietanh125 commented 2 months ago

Any updates for this?

mscheong01 commented 2 months ago

@vietanh125 Not yet, but contributions are welcome 😃

ggerganov commented 2 months ago

There is ongoing related work in https://github.com/ggerganov/llama.cpp/pull/6828. Though I haven't had time to look in details yet

ggerganov / llama.cpp

Support speculative decoding in `server` example #5877

Prerequisites

Feature Description

Motivation

Possible Implementation