test speculative decode accuracy

baoqianmagik commented 1 month ago

I use lm-evaluation-harness to test vllm accuracy 1.when don't enable spec decode,I got some result below vllm command: CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model /mnt/nvme0n1/ckpt/llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code --tensor-parallel-size 1 --chat-template ./examples/template_chatml.jinja --host 0.0.0.0 --port 9001

then I use api to evaluate lm_eval --model local-completions --tasks arc_easy --model_args model=/mnt/nvme0n1/ckpt/llama/Meta-Llama-3.1-8B-Instruct,base_url=http://127.0.0.1:9001/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False

num_concurrent=1

num_concurrent=8

num_concurrent=16

num_concurrent=32

2.when enable spec decode,I got some result below vllm command: CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model /mnt/nvme0n1/ckpt/llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code -tp 1 --speculative_model /mnt/nvme0n1/ckpt/llama/Meta-Llama-3.2-1B-Instruct --num_speculative_tokens 4 --seed 42 --use-v2-block-manager --chat-template ./examples/template_chatml.jinja --host 0.0.0.0 --port 9001

then I use api to evaluate lm_eval --model local-completions --tasks arc_easy --model_args model=/mnt/nvme0n1/ckpt/llama/Meta-Llama-3.1-8B-Instruct,base_url=http://127.0.0.1:9001/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False

num_concurrent=1

num_concurrent=8

num_concurrent=16

num_concurrent=32

Has anyone done such experiments? Does vLLM's speculative decoding affect output accuracy?

baoqianmagik commented 1 month ago

@baberabb I need hlep! thanks!!!

baoqianmagik commented 1 month ago

I use api to evaluate lm_eval --model local-completions --tasks arc_easy --model_args model=/mnt/nvme0n1/ckpt/llama/Meta-Llama-3.1-8B-Instruct,base_url=http://127.0.0.1:9001/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False

baoqianmagik commented 1 month ago

@haileyschoelkopf I need hlep! thanks!!!

baberabb commented 1 month ago

Hi! For loglikehood based tasks such as arc_easy, we are simply doing one forward pass of the model to get the logits so speculative decoding isn't really relevant. You should try a generation based task such as gsm8k or ifeval

baoqianmagik commented 1 month ago

Hi! For loglikehood based tasks such as arc_easy, we are simply doing one forward pass of the model to get the logits so speculative decoding isn't really relevant. You should try a generation based task such as gsm8k or ifeval

thanks! I use gsm8k test and get result,I have two more questions 1.my model is Meta-Llama-3.1-8B-Instruct,I got the result below Is this result reasonable? I obtained it on a single H100 GPU.

2.when I enable sepc deocde using vllm my test command is : lm_eval --model local-completions --tasks gsm8k --model_args model=/mnt/nvme0n1/ckpt/llama/Meta-Llama-3.1-8B-Instruct,base_url=http://127.0.0.1:9001/v1/completions,num_concurrent=16,max_retries=3,tokenized_requests=False

I noticed that in the lm_eval command, I only specified Meta-Llama-3.1-8B-Instruct and did not specify a draft model. Does lm_eval not require specifying a draft model? @baberabb hope your reply, thanks!

EleutherAI / lm-evaluation-harness

test speculative decode accuracy #2424