Paper: https://arxiv.org/abs/2406.17276
We propose an adaptive and scalable draft tree structure in speculative decoding, which supports for any autoregressive draft models. More than 10 tokens can be generated in a single decoding step with OPT-Tree. An example is shown below: Blue tokens are drafted by llama-2-chat-7b and verified by llama-2-chat-70b in a single decoding step. Red tokens are generated by llama-2-chat-70b.
pip install -r requirements.txt
export CUDA_VISIBLE_DEVICES=0 #Also support for multiple GPUs
python demo_opt_classic.py
export CUDA_VISIBLE_DEVICES=0 #Also support for multiple GPUs
python demo_opt_eagle.py
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m evaluation.eval_opt_classic \
--draft-model-path JackFram/llama-68m \
--base-model-path meta-llama/Llama-2-7b-chat-hf \
--bench-name mt_bench \
--answer-file ./mt_classic_opt.jsonl \
--temperature 0 \
--nodes 60 \
--threshold 0.5 \
--max_depth 10
EAGLE draft models can be downloaded from https://github.com/SafeAILab/EAGLE.
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m evaluation.eval_opt_eagle \
--ea-model-path yuhuili/EAGLE-llama2-chat-7B \
--base-model-path meta-llama/Llama-2-7b-chat-hf \
--bench-name mt_bench \
--answer-file ./mt_eagle_opt.jsonl \
--temperature 0 \
--nodes 60 \
--threshold 0.5 \
--max_depth 10
@misc{wang2024opttreespeculativedecodingadaptive,
title={OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure},
author={Jikai Wang and Yi Su and Juntao Li and Qinrong Xia and Zi Ye and Xinyu Duan and Zhefeng Wang and Min Zhang},
year={2024},
eprint={2406.17276},
archivePrefix={arXiv},
primaryClass={cs.CL}
url={https://arxiv.org/abs/2406.17276},
}