Jikai0Wang / OPT-Tree

OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
15 stars 1 forks source link

OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure

Paper: https://arxiv.org/abs/2406.17276

Contents

Introduction

OPT-Tree

We propose an adaptive and scalable draft tree structure in speculative decoding, which supports for any autoregressive draft models. More than 10 tokens can be generated in a single decoding step with OPT-Tree. An example is shown below: image Blue tokens are drafted by llama-2-chat-7b and verified by llama-2-chat-70b in a single decoding step. Red tokens are generated by llama-2-chat-70b.

Installation

pip install -r requirements.txt

Demo

With independent draft models

export CUDA_VISIBLE_DEVICES=0 #Also support for multiple GPUs
python demo_opt_classic.py

With EAGLE draft models

export CUDA_VISIBLE_DEVICES=0 #Also support for multiple GPUs
python demo_opt_eagle.py

Evaluation on datasets

With independent draft models

export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m evaluation.eval_opt_classic \
         --draft-model-path JackFram/llama-68m \
         --base-model-path meta-llama/Llama-2-7b-chat-hf \
         --bench-name mt_bench \
         --answer-file ./mt_classic_opt.jsonl \
         --temperature 0 \
         --nodes 60 \
         --threshold 0.5 \
         --max_depth 10

With EAGLE draft models

EAGLE draft models can be downloaded from https://github.com/SafeAILab/EAGLE.

export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m evaluation.eval_opt_eagle \
         --ea-model-path yuhuili/EAGLE-llama2-chat-7B \
         --base-model-path meta-llama/Llama-2-7b-chat-hf \
         --bench-name mt_bench \
         --answer-file ./mt_eagle_opt.jsonl \
         --temperature 0 \
         --nodes 60 \
         --threshold 0.5 \
         --max_depth 10

Citation

@misc{wang2024opttreespeculativedecodingadaptive,
      title={OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure}, 
      author={Jikai Wang and Yi Su and Juntao Li and Qinrong Xia and Zi Ye and Xinyu Duan and Zhefeng Wang and Min Zhang},
      year={2024},
      eprint={2406.17276},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
      url={https://arxiv.org/abs/2406.17276}, 
}