Closed AlpinDale closed 3 weeks ago
Thanks to the great work from AnyScale, Speculative Decoding works properly now. Example usage:
API Server:
aphrodite run meta-llama/Llama-2-7b-chat-hf --speculative-model JackFram/llama-68m --num-speculative-tokens 5 --use-v2-block-manager
With the LLM class:
LLM
from aphrodite import LLM, SamplingParams MODEL="meta-llama/Llama-2-7b-chat-hf" SPEC_MODEL="JackFram/llama-68m" prompts = [ "Once upon a time,", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model=MODEL, speculative_model=SPEC_MODEL, num_speculative_tokens=5, use_v2_block_manager=True) outputs = llm.generate(prompts, sampling_params)
Thanks to the great work from AnyScale, Speculative Decoding works properly now. Example usage:
API Server:
With the
LLM
class: