Questions about the measurement of the latency

Hello, FlexFlow team!

Thank you for your outstanding work! I am attempting to reproduce the experimental results from the paper "SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification." on a single H100.However, we encountered some issues and would like to understand how these results compare with the vllm framework. The details are as follows:

Dataset: We used the first ten prompts from alpaca.json, one of the five datasets provided by the team.

Model: LLM: meta-llama/Llama-2-7b-hf SSM: jackfram/llama-68m (As I am unable to access Hugging Face directly, I downloaded the model parameters locally.)

Parameter Settings: For SpecInfer: max_requests_per_batch = 16, max_seq_length = 256, max_tokens_per_batch = 128, temperature = 0.8, top_p = 0.95 For vllm: temperature = 0.8, top_p = 0.95, max_tokens = 256

Environment Configuration: For SpecInfer, I installed version v24.1.0 from source. For vllm, I used pip install vllm.

During the testing of SpecInfer, I referred to the code in issue #1377. My run_specinfer.py script is as follows:

import flexflow.serve as ff
import argparse
import json
import os

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--num_gpus', default=1, type=int)
    parser.add_argument('--memory_per_gpu', default=60000, type=int)
    parser.add_argument('--zero_copy_memory_per_node', default=60000, type=int)
    parser.add_argument('--tensor_parallelism_degree', default=1, type=int)
    parser.add_argument('--pipeline_parallelism_degree', default=1, type=int)
    parser.add_argument('--llm', default='facebook/opt-125m', type=str)
    parser.add_argument('--ssm', default='facebook/opt-125m', type=str)
    parser.add_argument('--prompts_file', default='prompts/Alpaca.json', type=str)
    parser.add_argument('--max_requests_per_batch', default=16, type=int)
    parser.add_argument('--max_seq_length', default=256, type=int)
    parser.add_argument('--max_tokens_per_batch', default=128, type=int)
    args = parser.parse_args()

    os.environ['TRANSFORMERS_OFFLINE'] = '1'

    ff.init(num_gpus=args.num_gpus,
            memory_per_gpu=args.memory_per_gpu,
            zero_copy_memory_per_node=args.zero_copy_memory_per_node,
            tensor_parallelism_degree=args.tensor_parallelism_degree,
            pipeline_parallelism_degree=args.pipeline_parallelism_degree
        )

    #pdb.set_trace()

    # Specify the LLM
    llm = ff.LLM(args.llm)

    # Specify a list of SSMs (just one in this case)
    ssms=[]
    if args.ssm != '':
        ssm_names = args.ssm.split(',')
        for ssm_name in ssm_names:
            ssm = ff.SSM(ssm_name)
            ssms.append(ssm)

    # Create the sampling configs
    generation_config = ff.GenerationConfig(
        do_sample=False, temperature=0.8, topp=0.95, topk=1
    )

    # Compile the SSMs for inference and load the weights into memory
    for ssm in ssms:
        ssm.compile(generation_config,
                    max_requests_per_batch=args.max_requests_per_batch,
                    max_seq_length=args.max_seq_length,
                    max_tokens_per_batch=args.max_tokens_per_batch)

    # Compile the LLM for inference and load the weights into memory
    llm.compile(generation_config, 
                ssms=ssms,
                max_requests_per_batch=args.max_requests_per_batch,
                max_seq_length=args.max_seq_length,
                max_tokens_per_batch=args.max_tokens_per_batch
               )

    # load prompts
    with open(args.prompts_file, 'r') as f:
        prompts = json.load(f)

    llm.start_server()
    result = llm.generate(prompts=prompts)

Command-line execution：

python3 run_spec.py --num_gpus 1 --memory_per_gpu 60000 --zero_copy_memory_per_node 60000 \
--tensor_parallelism_degree 1 --pipeline_parallelism_degree 1 --max_requests_per_batch 16 \
--max_seq_length 256 --max_tokens_per_batch 128 --llm ~/meta-llama/Llama-2-7b-hf \
--ssm ~/JackFram/llama-68m --prompts_file ~/prompts/test.json > resultOfspec.txt

For testing vllm, I referred to the code in issue #995. My run_vllm.py script is as follows:

from vllm import LLM, SamplingParams
import time
import json

# prompts = ["What's the best way to cook an egg?\n"] * 10
prompts_file = "/home/wutong/prompts/test.json"

with open(prompts_file, 'r') as f:
    prompts = json.load(f)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)

# llm = LLM(model="decapoda-research/llama-7b-hf", tokenizer='hf-internal-testing/llama-tokenizer')
# llm = LLM(model="decapoda-research/llama-7b-hf", tokenizer='hf-internal-testing/llama-tokenizer')
llm = LLM(model="/home/wutong/meta-llama/Llama-2-7b-hf/")
start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
print("--- %s seconds ---" % (time.time() - start_time))

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Command-line execution：

python3 run_vllm.py > resultOfvllm.txt

The logs obtained from SpecInfer are as follows: resultOfSpec.txt The logs obtained from vllm are as follows: resultOfvllm.txt According to the team's previous issues, the latency (in microsecond) for each prompt represents the computation time. Therefore, I summed the latency of the ten prompts, and the result is: 1189722.0 + 1190138.0 + 1318237.0 + 1598564.0 + 1734440.0 + 2855074.0 + 2855302.0 + 3304062.0 + 3902707.0 + 4895604.0 = 24,843,850 microsecond = 24.84385 s

(I also used time.time() in Python to measure the time required for vllm, and the result is:3.26208758354187s

My test results seem unusual. Could you please advise if there are any errors in my testing method? Additionally, any further details on reproducing the paper's results would be greatly appreciated.

Additionally, I tested the difference between Incremental decoding and Speculative decoding:

For Incremental decoding, I used the following code:

import flexflow.serve as ff

ff.init(
        num_gpus=1,
        memory_per_gpu=56000,
        zero_copy_memory_per_node=120000,
        tensor_parallelism_degree=1,
        pipeline_parallelism_degree=1
    )

# Specify the LLM
# llm = ff.LLM("meta-llama/Llama-2-7b-hf")
llm = ff.LLM("/public/home/wutong/meta-llama/Llama-2-7b-hf")
# Specify a list of SSMs (just one in this case)
ssms=[]
# ssm = ff.SSM("JackFram/llama-68m")
ssm = ff.SSM("/public/home/wutong/JackFram/llama-68m")
ssms.append(ssm)

# Create the sampling configs
generation_config = ff.GenerationConfig(
    do_sample=False, temperature=0.9, topp=0.8, topk=1
)

# Compile the SSMs for inference and load the weights into memory
for ssm in ssms:
    ssm.compile(generation_config)

# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config,
            max_requests_per_batch = 16,
            max_seq_length = 256,
            max_tokens_per_batch = 128,
            ssms=ssms)

llm.start_server()
result = llm.generate("Here are some travel tips for Tokyo:\n")
# result = llm.generate("Give three tips for staying healthy.")
llm.stop_server() # This invocation is optional

For Speculative decoding, I used the following code:

import flexflow.serve as ff

# Initialize the FlexFlow runtime. ff.init() takes a dictionary or the path to a JSON file with the configs
ff.init(
        num_gpus=1,
        memory_per_gpu=56000,
        zero_copy_memory_per_node=120000,
        tensor_parallelism_degree=1,
        pipeline_parallelism_degree=1
    )

# Create the FlexFlow LLM
# llm = ff.LLM("meta-llama/Llama-2-7b-hf")
llm = ff.LLM("/public/home/wutong/meta-llama/Llama-2-7b-hf")
# Create the sampling configs
generation_config = ff.GenerationConfig(
    do_sample=True, temperature=0.9, topp=0.8, topk=1
)

# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config,
            max_requests_per_batch = 16,
            max_seq_length = 256,
            max_tokens_per_batch = 128)

# Generation begins!
llm.start_server()
result = llm.generate("Here are some travel tips for Tokyo:\n")
# result = llm.generate("Give three tips for staying healthy.")
llm.stop_server() # This invocation is optional

When testing with the prompt "Here are some travel tips for Tokyo:\n", I obtained the same result. However, when testing with the prompt "Give three tips for staying healthy.", I received different results.

The result for "Incremental decoding" was:

Final output: <s> Give three tips for staying healthy.
Avoid alcohol, cigarettes, and drugs.
Drink at least 8 glasses of water a day.
Exercise for at least 30 minutes a day.
Name three things that you can do to keep your heart healthy.
Name three things that you can do to keep your brain healthy.
Name three things that you can do to keep your lungs healthy.
Name three things that you can do to keep your kidneys healthy.
Name three things that you can do to keep your li

The result for "Speculative decoding" was:

Final output: <s> Give three tips for staying healthy.
Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy. Give three tips for staying healthy

Is this normal?

flexflow / FlexFlow

Questions about the measurement of the latency #1454