Open baoqianmagik opened 1 month ago
@baberabb I need hlep! thanks!!!
I use api to evaluate lm_eval --model local-completions --tasks arc_easy --model_args model=/mnt/nvme0n1/ckpt/llama/Meta-Llama-3.1-8B-Instruct,base_url=http://127.0.0.1:9001/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False
@haileyschoelkopf I need hlep! thanks!!!
Hi! For loglikehood based tasks such as arc_easy, we are simply doing one forward pass of the model to get the logits so speculative decoding isn't really relevant. You should try a generation based task such as gsm8k
or ifeval
Hi! For loglikehood based tasks such as arc_easy, we are simply doing one forward pass of the model to get the logits so speculative decoding isn't really relevant. You should try a generation based task such as
gsm8k
orifeval
thanks! I use gsm8k test and get result,I have two more questions 1.my model is Meta-Llama-3.1-8B-Instruct,I got the result below Is this result reasonable? I obtained it on a single H100 GPU.
2.when I enable sepc deocde using vllm my test command is : lm_eval --model local-completions --tasks gsm8k --model_args model=/mnt/nvme0n1/ckpt/llama/Meta-Llama-3.1-8B-Instruct,base_url=http://127.0.0.1:9001/v1/completions,num_concurrent=16,max_retries=3,tokenized_requests=False
I noticed that in the lm_eval command, I only specified Meta-Llama-3.1-8B-Instruct and did not specify a draft model. Does lm_eval not require specifying a draft model? @baberabb hope your reply, thanks!
I use lm-evaluation-harness to test vllm accuracy 1.when don't enable spec decode,I got some result below vllm command: CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model /mnt/nvme0n1/ckpt/llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code --tensor-parallel-size 1 --chat-template ./examples/template_chatml.jinja --host 0.0.0.0 --port 9001
then I use api to evaluate lm_eval --model local-completions --tasks arc_easy --model_args model=/mnt/nvme0n1/ckpt/llama/Meta-Llama-3.1-8B-Instruct,base_url=http://127.0.0.1:9001/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False
num_concurrent=1
num_concurrent=8
num_concurrent=16
num_concurrent=32
2.when enable spec decode,I got some result below vllm command: CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model /mnt/nvme0n1/ckpt/llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code -tp 1 --speculative_model /mnt/nvme0n1/ckpt/llama/Meta-Llama-3.2-1B-Instruct --num_speculative_tokens 4 --seed 42 --use-v2-block-manager --chat-template ./examples/template_chatml.jinja --host 0.0.0.0 --port 9001
then I use api to evaluate lm_eval --model local-completions --tasks arc_easy --model_args model=/mnt/nvme0n1/ckpt/llama/Meta-Llama-3.1-8B-Instruct,base_url=http://127.0.0.1:9001/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False
num_concurrent=1
num_concurrent=8
num_concurrent=16
num_concurrent=32
Has anyone done such experiments? Does vLLM's speculative decoding affect output accuracy?