Infini-AI-Lab / Sequoia

scalable and robust tree-based speculative decoding algorithm
282 stars 29 forks source link

How to benchmark for speedup and acceptance rate? #12

Open singularity-s0 opened 3 months ago

singularity-s0 commented 3 months ago

Sorry for asking a possibly obvious question but it would be better if the documentation makes this clear.

cyLi-Tiger commented 3 months ago

+1 How to benchmark the speed up? I ran the example codes and didn't see obvious acceleration. How to reproduce 4.04x accelerate of Llama2-7b on A100?

dreaming-panda commented 3 months ago

+1 How to benchmark the speed up? I ran the example codes and didn't see obvious acceleration. How to reproduce 4.04x accelerate of Llama2-7b on A100?

To run Sequoia: CUDA_VISIBLE_DEVICES=0 python testbed_greedy.py --model JackFram/llama-68m --target meta-llama/Llama-2-7b-hf --T 0.6 --P 1.0 --start 0 --end 200 --M 384 --growmap ../A100_growmaps/68m_7b/growmaps/A100-C4-68m-7b-greedy.pt --Mode greedy --dataset c4 To run baseline: CUDA_VISIBLE_DEVICES=0 python testbed_greedy.py --model JackFram/llama-68m --target meta-llama/Llama-2-7b-hf --T 0.6 --P 1.0 --start 0 --end 200 --M 384 --growmap ../A100_growmaps/68m_7b/growmaps/A100-C4-68m-7b-greedy.pt --Mode baseline --dataset c4

As the framework is written in Huggingface, the baseline should be around 23ms ~ 25ms per token, Sequoia should be 6ms ~ 7ms per token.

singularity-s0 commented 3 months ago

Thanks for the response. How about acceptance rate? What does decoding step and large model step mean in the output?

dreaming-panda commented 3 months ago

decoding step means how many tokens are generated. large model step means how many times large model do verification. decoding step / large model step reflects how many tokens are correctly predicted with Sequoia's tree.

acceptance rate needs to be independently measured with python test_accept.py --model JackFram/llama-68m --target meta-llama/Llama-2-7b-hf \ --T 0.6 --P 1.0 --start 0 --end 200 --M 288 --W 32\ --ALG stochastic --dataset cnn \

singularity-s0 commented 3 months ago

Thank you. This answers all my questions.

briskerkazoos commented 3 months ago

After testing both baseline and greedy on C4 dataset on A100, I get the following result:

Baseline: total time :110.10318s, latency :0.02298s, decoding step: 4791 Greedy: total time :144.56247s, latency :0.00813s, decoding step: 17778, large model step: 4605, 3.8605863192182412

It seems that more tokens are being generated in greedy mode than in baseline mode. Although the generation latency is the same as expected, I wonder if it is rather unfair to compare latency when generating different tokens. Would it be better if we set a fixed sequence length and compare generation time instead?

decoding step / large model step reflects how many tokens are correctly predicted with Sequoia's tree.

Just to make sure I understand this correctly, if all drafts are wrong, then decoding step / large model step = 1. And if decoding step / large model step = 2, it means that on average, the drafting model gets 1 token correct per draft. Is this right?

dreaming-panda commented 3 months ago

Your understanding is correct. We only allow baseline to generate 32 tokens is because in some experiments, such as Vicuna33B, running baseline can cost a lot of time.

You can change this manually if you want. What you need to modify is inner_decoding_step < 32 in testbed.py. Also, we plan to update the code in the following weeks. We will solve the problem.