Questions about the measurement of the latency

ChuanhongLi commented 1 year ago

Hi FlexFlow team, I used the methods mentioned in #1099 to test the latency（GPU: RTX-4090）, but i get a confused result： 1）LLaMA-7B + 1个SSM(llama-160M), latency: 25.1 s 2）LLaMA-7B(without ssms), latency: 24.8 s Right？Without ssms，works better?

The code:

import flexflow.serve as ff
import time

ff.init(
    num_gpus=1,
    memory_per_gpu=22000,
    zero_copy_memory_per_node=30000,
    tensor_parallelism_degree=1,
    pipeline_parallelism_degree=1
)

llm = ff.LLM("/data/lich/llama-7b-hf")

ssms = []
# Specify a list of SSMs，/data/lich/llama-160m（I have downloaded the model from https://huggingface.co/JackFram/llama-160m, and put it to the path /data/lich/）
# test without ssms
# ssm = ff.SSM("/data/lich/llama-160m")
# ssms.append(ssm)

generation_config = ff.GenerationConfig(
    do_sample=False, temperature=0.9, topp=0.8, topk=1
)

for ssm in ssms:
    ssm.compile(generation_config)

llm.compile(generation_config, ssms=ssms)

# test data comes from WebQA
prompts = [
    "what is the name of justin bieber brother?",
    "what character did natalie portman play in star wars?",
    "what state does selena gomez?",
    "what country is the grand bahama island in?",
    "what kind of money to take to bahamas?",
    "what character did john noble play in lord of the rings?",
    "who does joakim noah play for?",
    "where are the nfl redskins from?",
    "where did saki live?"
]

start_time = time.time()
result = llm.generate(prompts)
print("--- %s seconds ---" % (time.time() - start_time))

And, the logs are as follows: with ssms.txt without small model.txt

By the way, how can i reproduce the result of the paper - SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification？ Using above code is OK or something else?

Thanks!

ChuanhongLi commented 1 year ago

Another question：In 6.2. Distributed LLM Inference of the paper，“To rule out potential effects of our runtime implementation, we also evaluate SpecInfer using incremental decoding, which is achieved by sending an empty token tree to the verifier, so the verifier verifies exactly one token in each decoding step.”

GenerationResult RequestManager::generate_spec_infer(FFModel *llm, std::string const &text,  int max_seq_length){
....
// Token Tree Verification
    {
      TreeVerifyBatchConfigFuture tree_bcf =prepare_next_batch_verify(beam_bcf_vec);
      // TreeVerifyBatchConfigFuture tree_bcf;
      FutureMap fm = im->inference(llm, 0, tree_bcf);  // just empty tree_bcf ? 
      assert(fm.get_future_map_domain().get_volume() == 1);
      InferenceResultFuture tree_irf = fm.get_future(0);
      batch_pipeline.push(std::make_pair(tree_bcf, tree_irf));
      last_tree_bcf = tree_bcf;
      last_tree_irf = tree_irf;
    }
...
}

FutureMap fm = im->inference(llm, 0, tree_bcf); // just empty tree_bcf ?
I am not sure how to get the empty token tree. I have tried to do this, but i get a core dump [Segmentation fault (core dumped)].

Look forward to your help! Thanks!

ChuanhongLi commented 1 year ago

Hi, is there an answer to the question?

jiazhihao commented 1 year ago

@ChuanhongLi You can use the generate_incr_decoding function to generate tokens when no SSMs were given (https://github.com/flexflow/FlexFlow/blob/inference/src/runtime/request_manager.cc#L1601).

ChuanhongLi commented 1 year ago

@ChuanhongLi You can use the generate_incr_decoding function to generate tokens when no SSMs were given (https://github.com/flexflow/FlexFlow/blob/inference/src/runtime/request_manager.cc#L1601). @jiazhihao When ssms = [], it callls the 'generate_incr_decoding' funtion, as logs in
without small model.txt [No small speculative model registered, using incremental decoding]. Then, confuesd results(my first question): 1）LLaMA-7B + 1个SSM(llama-160M), latency: 25.1 s 2）LLaMA-7B(without ssms), latency: 24.8 s.

jiazhihao commented 1 year ago

@ChuanhongLi Can you also share the log when an SSM is enabled?

ChuanhongLi commented 1 year ago

@ChuanhongLi Can you also share the log when an SSM is enabled?

Hi, this is the log with an SSM: with ssms.txt

jiazhihao commented 1 year ago

@ChuanhongLi The log prints the latency of each request in the following format

{3}{RequestManager}: [Profile] guid(1000005) decoding_steps(21) start(204168049.0) finish(206250065.0) latency(2082016.0)

where start and finish are the start and completion time of the request, and latency shows the end-to-end latency of a single request (all in microseconds).

Meanwhile, for 7B LLAMA models, we observe that using the LLAMA-68M as the SSM can give you better performance.

ChuanhongLi commented 1 year ago

@ChuanhongLi The log prints the latency of each request in the following format
{3}{RequestManager}: [Profile] guid(1000005) decoding_steps(21) start(204168049.0) finish(206250065.0) latency(2082016.0)
where start and finish are the start and completion time of the request, and latency shows the end-to-end latency of a single request (all in microseconds).

Meanwhile, for 7B LLAMA models, we observe that using the LLAMA-68M as the SSM can give you better performance.

start_time = time.time()
result = llm.generate(prompts)
print("--- %s seconds ---" % (time.time() - start_time))

At the end of the log, we print the total time to proccess the whole prompts(--- 25.1096248626709 seconds ---). Besides, we also use the end time of the last request to minus the start time of the first request, getting a similar result.

Thanks for your reply. We will have a try for the LLAMA-68M as the SSM.

AngryBear2 commented 11 months ago

@ChuanhongLi Hello, I also found the same phenomenon as you, can we measure the expected experimental data later?

ChuanhongLi commented 11 months ago

@ChuanhongLi Hello, I also found the same phenomenon as you, can we measure the expected experimental data later?

Not yet. I have been busy something else.

AngryBear2 commented 11 months ago

@ChuanhongLi Hello, I also found the same phenomenon as you, can we measure the expected experimental data later?

Not yet. I have been busy something else.

All right,Thank you.

zym1599 commented 10 months ago

日志按以下格式打印每个请求的延迟
{3}{RequestManager}: [Profile] guid(1000005) decoding_steps(21) start(204168049.0) finish(206250065.0) latency(2082016.0)
其中和是请求的开始和完成时间，并显示单个请求的端到端延迟（全部以微秒为单位）。start``finish``latency

同时，对于 7B LLAMA 型号，我们观察到使用 LLAMA-68M 作为 SSM 可以为您提供更好的性能。

My time still slows down after using LLaMA-68m

zym1599 commented 10 months ago

日志按以下格式打印每个请求的延迟
{3}{RequestManager}: [Profile] guid(1000005) decoding_steps(21) start(204168049.0) finish(206250065.0) latency(2082016.0)
其中和是请求的开始和完成时间，并显示单个请求的端到端延迟（全部以微秒为单位）。start``finish``latency

同时，对于 7B LLAMA 型号，我们观察到使用 LLAMA-68M 作为 SSM 可以为您提供更好的性能。 I didn't reproduce this result, can you elaborate on how it was tested? Thanks!

QAZWSX0827 commented 3 months ago

Hello, have you successfully reproduced the results of SpecInfer?

flexflow / FlexFlow

Questions about the measurement of the latency #1130