Closed littletomatodonkey closed 5 months ago
Thank you for your interest in our research!
To clarify, our algorithm is designed for decoding acceleration, not pre-filling. For the pre-fill phase, we use a straightforward iterative method to avoid out-of-memory errors. Thus, our focus is on optimizing the decoding (generation) stage rather than encoding (prompting).
There are numerous approaches to achieve time-efficient and memory-efficient pre-filling, which are orthogonal to our work. You may consider combining these approaches for a faster pre-fill phase. If you find an efficient method for pre-filling, there's no need for iterative pre-filling as we do. Alternatively, you can adjust the iteration settings in our code to manage your GPU's HBM efficiently.
And according to your log, seems you got 2.13x for decoding speedup. I think it is within expectation.
Got it, thanks for your reply!
Hi, thanks for your great job for LLM decoding process. I tested the code and got the expected decoding speedup for llama2-7B, but it seems that the end2end time cost does not change too much? (61s -> 58s). I profile the inference process and it seems that the prefill process occupied the vast majority of inference time. Is the conclusion same with your experiments? Thanks !