Open QAZWSX0827 opened 1 month ago
Hi, is there an answer to the question?
Additionally, I tested the difference between Incremental decoding and Speculative decoding:
For Incremental decoding, I used the following code:
import flexflow.serve as ff
ff.init(
num_gpus=1,
memory_per_gpu=56000,
zero_copy_memory_per_node=120000,
tensor_parallelism_degree=1,
pipeline_parallelism_degree=1
)
# Specify the LLM
# llm = ff.LLM("meta-llama/Llama-2-7b-hf")
llm = ff.LLM("/public/home/wutong/meta-llama/Llama-2-7b-hf")
# Specify a list of SSMs (just one in this case)
ssms=[]
# ssm = ff.SSM("JackFram/llama-68m")
ssm = ff.SSM("/public/home/wutong/JackFram/llama-68m")
ssms.append(ssm)
# Create the sampling configs
generation_config = ff.GenerationConfig(
do_sample=False, temperature=0.9, topp=0.8, topk=1
)
# Compile the SSMs for inference and load the weights into memory
for ssm in ssms:
ssm.compile(generation_config)
# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config,
max_requests_per_batch = 16,
max_seq_length = 256,
max_tokens_per_batch = 128,
ssms=ssms)
llm.start_server()
result = llm.generate("Here are some travel tips for Tokyo:\n")
# result = llm.generate("Give three tips for staying healthy.")
llm.stop_server() # This invocation is optional
For Speculative decoding, I used the following code:
import flexflow.serve as ff
# Initialize the FlexFlow runtime. ff.init() takes a dictionary or the path to a JSON file with the configs
ff.init(
num_gpus=1,
memory_per_gpu=56000,
zero_copy_memory_per_node=120000,
tensor_parallelism_degree=1,
pipeline_parallelism_degree=1
)
# Create the FlexFlow LLM
# llm = ff.LLM("meta-llama/Llama-2-7b-hf")
llm = ff.LLM("/public/home/wutong/meta-llama/Llama-2-7b-hf")
# Create the sampling configs
generation_config = ff.GenerationConfig(
do_sample=True, temperature=0.9, topp=0.8, topk=1
)
# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config,
max_requests_per_batch = 16,
max_seq_length = 256,
max_tokens_per_batch = 128)
# Generation begins!
llm.start_server()
result = llm.generate("Here are some travel tips for Tokyo:\n")
# result = llm.generate("Give three tips for staying healthy.")
llm.stop_server() # This invocation is optional
When testing with the prompt "Here are some travel tips for Tokyo:\n", I obtained the same result. However, when testing with the prompt "Give three tips for staying healthy.", I received different results.
The result for "Incremental decoding" was:
Final output: <s> Give three tips for staying healthy.
Avoid alcohol, cigarettes, and drugs.
Drink at least 8 glasses of water a day.
Exercise for at least 30 minutes a day.
Name three things that you can do to keep your heart healthy.
Name three things that you can do to keep your brain healthy.
Name three things that you can do to keep your lungs healthy.
Name three things that you can do to keep your kidneys healthy.
Name three things that you can do to keep your li
The result for "Speculative decoding" was:
Final output: <s> Give three tips for staying healthy.
Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy
Is this normal?
Hello, FlexFlow team!
Thank you for your outstanding work! I am attempting to reproduce the experimental results from the paper "SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification." on a single H100.However, we encountered some issues and would like to understand how these results compare with the vllm framework. The details are as follows:
Dataset: We used the first ten prompts from alpaca.json, one of the five datasets provided by the team.
Model: LLM: meta-llama/Llama-2-7b-hf SSM: jackfram/llama-68m (As I am unable to access Hugging Face directly, I downloaded the model parameters locally.)
Parameter Settings: For SpecInfer: max_requests_per_batch = 16, max_seq_length = 256, max_tokens_per_batch = 128, temperature = 0.8, top_p = 0.95 For vllm: temperature = 0.8, top_p = 0.95, max_tokens = 256
Environment Configuration: For SpecInfer, I installed version v24.1.0 from source. For vllm, I used pip install vllm.
During the testing of SpecInfer, I referred to the code in issue #1377. My run_specinfer.py script is as follows:
Command-line execution:
For testing vllm, I referred to the code in issue #995. My run_vllm.py script is as follows:
Command-line execution:
The logs obtained from SpecInfer are as follows: resultOfSpec.txt The logs obtained from vllm are as follows: resultOfvllm.txt According to the team's previous issues, the latency (in microsecond) for each prompt represents the computation time. Therefore, I summed the latency of the ten prompts, and the result is: 1189722.0 + 1190138.0 + 1318237.0 + 1598564.0 + 1734440.0 + 2855074.0 + 2855302.0 + 3304062.0 + 3902707.0 + 4895604.0 = 24,843,850 microsecond = 24.84385 s
(I also used time.time() in Python to measure the time required for vllm, and the result is:3.26208758354187s
My test results seem unusual. Could you please advise if there are any errors in my testing method? Additionally, any further details on reproducing the paper's results would be greatly appreciated.