flexflow / FlexFlow

FlexFlow Serve: Low-Latency, High-Performance LLM Serving
https://flexflow.readthedocs.io
Apache License 2.0
1.71k stars 229 forks source link

How are the benchmarks measured? #995

Open eugenepentland opened 1 year ago

eugenepentland commented 1 year ago

I am attempting to use FlexFlow to compare the inference speed to vLLM, but FlexFlow appears to be an order of magnitude slower than vLLM and I've been running into many errors. Testing on a Linux server with 2x 3090s.

Is there any documentation as to how the benchmarks were measured? I am trying to reproduce your tests and get a good idea of speed vs performance but I've been running into lots of issues. I am running your python example from your docs with llama7B for inference.

jiazhihao commented 1 year ago

We used the prompts from our prompt dataset (https://github.com/flexflow/FlexFlow#prompt-datasets), and measure the per-token latency (i.e., end-to-end inference latency of a request divided by the number of generated tokens). The FlexFlow RequestManager includes a print that shows the latency of each request (note that we exclude the queuing latency of each request).

We haven't evaluated FlexFlow's performance on 3090s since we don't have access to, but would be happy to help debug FlexFlow's performance and fix any issues. Can you share with us the performance result you get on 3090s and any issues you ran into?

eugenepentland commented 1 year ago

Here are the two tests that I am running. vLLM I am getting 5.8s to complete, and I am getting 26s with FlexFlow. I am testing on a single GPU for simplicity's sake as of right now. Do you have discord? I am part of the Open-Orca group trying to evaluate if FlexFlow could be a good inference engine for us.

Flexflow Code:

import flexflow.serve as ff
import datasets
import time

ff.init(
        num_gpus=1,
        memory_per_gpu=22048,
        zero_copy_memory_per_node=20000,
        tensor_parallelism_degree=1,
        pipeline_parallelism_degree=1
    )

# Specify the LLM
llm = ff.LLM("decapoda-research/llama-7b-hf")

# Specify a list of SSMs (just one in this case)
ssms=[]
ssm = ff.SSM("JackFram/llama-160m")
ssms.append(ssm)

# Create the sampling configs
generation_config = ff.GenerationConfig(
    do_sample=False, temperature=0.9, topp=0.8, topk=1
)

# Compile the SSMs for inference and load the weights into memory
for ssm in ssms:
    ssm.compile(generation_config)

# Compile the LLM for inference and load the weights into memory
compiled = llm.compile(generation_config, ssms=ssms)

prompts = ["What's the best way to cook an egg?\n"] * 10
start_time = time.time()
result = llm.generate(prompts)
print("--- %s seconds ---" % (time.time() - start_time))

vLLM Code:

from vllm import LLM, SamplingParams
import time

prompts = ["What's the best way to cook an egg?\n"] * 10
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)

llm = LLM(model="decapoda-research/llama-7b-hf", tokenizer='hf-internal-testing/llama-tokenizer')
start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
print("--- %s seconds ---" % (time.time() - start_time))

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
jiazhihao commented 1 year ago

Okay, an issue I have seen is that FlexFlow uses max_num_requests = 1 by default, so the ten prompts were executed sequentially. You can control the maximum number of ongoing requests by passing max_batch_size as an input argument to llm.compile: https://github.com/flexflow/FlexFlow/blob/inference/python/flexflow/serve/serve.py#L254. Note that we are aware of an issue in the RequestManager that will cause some slowdown when handling more than one ongoing request and are actively working on it in #978.

If you want to check the latency of each request, you should be able to see the following print in the output

[0 - 7fc9b0859740]  266.048872 {3}{RequestManager}: [Profile] guid(1000000) decoding_steps(118) start(262980704.0) finish(266048864.0) latency(3068160.0) acc_latency(3068160.0)

where the number in the latency parenthesis shows the end-to-end latency of a request in microseconds.

Do you have discord?

Sure, I would be happy to get involved and help with the evaluation. My discord username is zhihaojia.

QAZWSX0827 commented 3 months ago

Hello, have you successfully reproduced the results of SpecInfer?