PygmalionAI / aphrodite-engine

PygmalionAI's large-scale inference engine
https://pygmalion.chat
GNU Affero General Public License v3.0
606 stars 78 forks source link

feat: Speculative Decoding using a draft model #432

Closed AlpinDale closed 3 weeks ago

AlpinDale commented 3 weeks ago

Thanks to the great work from AnyScale, Speculative Decoding works properly now. Example usage:

API Server:

aphrodite run meta-llama/Llama-2-7b-chat-hf --speculative-model JackFram/llama-68m --num-speculative-tokens 5 --use-v2-block-manager

With the LLM class:

from aphrodite import LLM, SamplingParams

MODEL="meta-llama/Llama-2-7b-chat-hf"
SPEC_MODEL="JackFram/llama-68m"
prompts = [
"Once upon a time,",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model=MODEL, speculative_model=SPEC_MODEL, num_speculative_tokens=5, use_v2_block_manager=True)

outputs = llm.generate(prompts, sampling_params)