ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.13k stars 8.73k forks source link

Support speculative decoding in `server` example #5877

Open mscheong01 opened 4 months ago

mscheong01 commented 4 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Feature Description

provide speculative decoding through server example.

Motivation

Noticed this topic popped up in several comments (1, 2, 3) but it seems we haven't officially opened an issue for it. I'm creating this to provide a space for focused discussion on how we can implement this feature and actually get this started.

Possible Implementation

perhaps move speculative sampling implementation to common or sampling?

vietanh125 commented 2 months ago

Any updates for this?

mscheong01 commented 2 months ago

@vietanh125 Not yet, but contributions are welcome 😃

ggerganov commented 2 months ago

There is ongoing related work in https://github.com/ggerganov/llama.cpp/pull/6828. Though I haven't had time to look in details yet