Open ChuanhongLi opened 1 year ago
Another question:In 6.2. Distributed LLM Inference of the paper,“To rule out potential effects of our runtime implementation, we also evaluate SpecInfer using incremental decoding, which is achieved by sending an empty token tree to the verifier, so the verifier verifies exactly one token in each decoding step.”
GenerationResult RequestManager::generate_spec_infer(FFModel *llm, std::string const &text, int max_seq_length){
....
// Token Tree Verification
{
TreeVerifyBatchConfigFuture tree_bcf =prepare_next_batch_verify(beam_bcf_vec);
// TreeVerifyBatchConfigFuture tree_bcf;
FutureMap fm = im->inference(llm, 0, tree_bcf); // just empty tree_bcf ?
assert(fm.get_future_map_domain().get_volume() == 1);
InferenceResultFuture tree_irf = fm.get_future(0);
batch_pipeline.push(std::make_pair(tree_bcf, tree_irf));
last_tree_bcf = tree_bcf;
last_tree_irf = tree_irf;
}
...
}
FutureMap fm = im->inference(llm, 0, tree_bcf); // just empty tree_bcf ?
I am not sure how to get the empty token tree. I have tried to do this, but i get a core dump [Segmentation fault (core dumped)].
Look forward to your help! Thanks!
Hi, is there an answer to the question?
@ChuanhongLi You can use the generate_incr_decoding
function to generate tokens when no SSMs were given (https://github.com/flexflow/FlexFlow/blob/inference/src/runtime/request_manager.cc#L1601).
@ChuanhongLi You can use the
generate_incr_decoding
function to generate tokens when no SSMs were given (https://github.com/flexflow/FlexFlow/blob/inference/src/runtime/request_manager.cc#L1601). @jiazhihao When ssms = [], it callls the 'generate_incr_decoding' funtion, as logs in
without small model.txt [No small speculative model registered, using incremental decoding]. Then, confuesd results(my first question): 1)LLaMA-7B + 1个SSM(llama-160M), latency: 25.1 s 2)LLaMA-7B(without ssms), latency: 24.8 s.
@ChuanhongLi Can you also share the log when an SSM is enabled?
@ChuanhongLi Can you also share the log when an SSM is enabled?
Hi, this is the log with an SSM: with ssms.txt
@ChuanhongLi The log prints the latency of each request in the following format
{3}{RequestManager}: [Profile] guid(1000005) decoding_steps(21) start(204168049.0) finish(206250065.0) latency(2082016.0)
where start
and finish
are the start and completion time of the request, and latency
shows the end-to-end latency of a single request (all in microseconds).
Meanwhile, for 7B LLAMA models, we observe that using the LLAMA-68M as the SSM can give you better performance.
@ChuanhongLi The log prints the latency of each request in the following format
{3}{RequestManager}: [Profile] guid(1000005) decoding_steps(21) start(204168049.0) finish(206250065.0) latency(2082016.0)
where
start
andfinish
are the start and completion time of the request, andlatency
shows the end-to-end latency of a single request (all in microseconds).Meanwhile, for 7B LLAMA models, we observe that using the LLAMA-68M as the SSM can give you better performance.
start_time = time.time()
result = llm.generate(prompts)
print("--- %s seconds ---" % (time.time() - start_time))
At the end of the log, we print the total time to proccess the whole prompts(--- 25.1096248626709 seconds ---). Besides, we also use the end time of the last request to minus the start time of the first request, getting a similar result.
Thanks for your reply. We will have a try for the LLAMA-68M as the SSM.
@ChuanhongLi Hello, I also found the same phenomenon as you, can we measure the expected experimental data later?
@ChuanhongLi Hello, I also found the same phenomenon as you, can we measure the expected experimental data later?
Not yet. I have been busy something else.
@ChuanhongLi Hello, I also found the same phenomenon as you, can we measure the expected experimental data later?
Not yet. I have been busy something else.
All right,Thank you.
日志按以下格式打印每个请求的延迟
{3}{RequestManager}: [Profile] guid(1000005) decoding_steps(21) start(204168049.0) finish(206250065.0) latency(2082016.0)
其中 和 是请求的开始和完成时间,并显示单个请求的端到端延迟(全部以微秒为单位)。
start``finish``latency
同时,对于 7B LLAMA 型号,我们观察到使用 LLAMA-68M 作为 SSM 可以为您提供更好的性能。
My time still slows down after using LLaMA-68m
日志按以下格式打印每个请求的延迟
{3}{RequestManager}: [Profile] guid(1000005) decoding_steps(21) start(204168049.0) finish(206250065.0) latency(2082016.0)
其中 和 是请求的开始和完成时间,并显示单个请求的端到端延迟(全部以微秒为单位)。
start``finish``latency
同时,对于 7B LLAMA 型号,我们观察到使用 LLAMA-68M 作为 SSM 可以为您提供更好的性能。 I didn't reproduce this result, can you elaborate on how it was tested? Thanks!
Hello, have you successfully reproduced the results of SpecInfer?
Hi FlexFlow team, I used the methods mentioned in #1099 to test the latency(GPU: RTX-4090), but i get a confused result: 1)LLaMA-7B + 1个SSM(llama-160M), latency: 25.1 s 2)LLaMA-7B(without ssms), latency: 24.8 s Right?Without ssms,works better?
The code:
And, the logs are as follows: with ssms.txt without small model.txt
By the way, how can i reproduce the result of the paper - SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification? Using above code is OK or something else?
Thanks!