Questions about end2end time cost of the inference request

littletomatodonkey commented 5 months ago

Hi, thanks for your great job for LLM decoding process. I tested the code and got the expected decoding speedup for llama2-7B, but it seems that the end2end time cost does not change too much? (61s -> 58s). I profile the inference process and it seems that the prefill process occupied the vast majority of inference time. Is the conclusion same with your experiments? Thanks !

Method: TriForce
Dataset: gs
Spec Args: {'budget': 4096, 'chunk_size': 8}
Draft: /mnt/bn/multimodel/models/official/llama-68m
Target: /mnt/bn/multimodel/models/official/NousResearch--Yarn-Llama-2-7b-128k/model
Prefill Length: 124928
Generation Length: 256
Gamma: 6
Sampling Method: top_k = -1, top_p = 0.9, temperature = 0.6
Log CSV: None
######################################################################################

[draft run] capturing graph for 0 (probs=True, temp=0.6, top_p=0.9)...
[draft run] capturing graph for 1 (probs=True, temp=0.6, top_p=0.9)...
[draft run] capturing graph for 2 (probs=True, temp=0.6, top_p=0.9)...
[draft run] capturing graph for 3 (probs=True, temp=0.6, top_p=0.9)...
[draft run] capturing graph for 4 (probs=True, temp=0.6, top_p=0.9)...
[draft run] capturing graph for 5 (probs=True, temp=0.6, top_p=0.9)...
[draft run] capturing graph for 6 (probs=True, temp=0.6, top_p=0.9)...
[draft run] capturing graph for 7 (probs=True, temp=0.6, top_p=0.9)...
[draft run] capturing graph for 8 (probs=True, temp=0.6, top_p=0.9)...
[model verify] capturing graph for spec len 6 (probs=True, temp=0.6, top_p=0.9)...
[Full Cache] Cached: 0 | Budget: 125200
[Retrieval Cache] Budget: 4096  | PreFill: 124928  | Chunk Size: 8  | Chunks: 15616  | Select Sets: 512
[StreamingLLM Cache] Start Size: 16 | Recent Size: 234 | Gamma: 6 | Real Budget: 259 | Cached: 0
tokenized_prompts length: 20
Autoregressive Warmup: 100%|████████████████████████████████████████████████| 1/1 [01:01<00:00, 61.31s/it]
Autoregressive Test: 100%|██████████████████████████████████████████████████| 1/1 [01:01<00:00, 61.73s/it]
[Autoregressive] average latency: 51.494828425347805 ms
TriForce Warmup: 100%|██████████████████████████████████████████████████████| 3/3 [02:53<00:00, 57.91s/it]
TriForce Test: 100%|██████████████████████████████████████████████████████| 20/20 [19:33<00:00, 58.66s/it]
average acceptance rate (NOT per token): 0.7204096470358102
[TriForce] average latency: 24.157854936546297 ms
[E2E Speedup]: 2.1315977167925535

preminstrel commented 5 months ago

Thank you for your interest in our research!

To clarify, our algorithm is designed for decoding acceleration, not pre-filling. For the pre-fill phase, we use a straightforward iterative method to avoid out-of-memory errors. Thus, our focus is on optimizing the decoding (generation) stage rather than encoding (prompting).

There are numerous approaches to achieve time-efficient and memory-efficient pre-filling, which are orthogonal to our work. You may consider combining these approaches for a faster pre-fill phase. If you find an efficient method for pre-filling, there's no need for iterative pre-filling as we do. Alternatively, you can adjust the iteration settings in our code to manage your GPU's HBM efficiently.

And according to your log, seems you got 2.13x for decoding speedup. I think it is within expectation.

littletomatodonkey commented 5 months ago

Got it, thanks for your reply!

Infini-AI-Lab / TriForce

Questions about end2end time cost of the inference request #5