Infini-AI-Lab / MagicDec

Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
Apache License 2.0
60 stars 4 forks source link

KV Loading Time #1

Open wutong4012 opened 4 weeks ago

wutong4012 commented 4 weeks ago

Thank you very much for your interesting work and contribution to the open source community.

I would like to ask, how is the KV loading time calculated in your paper? How is it strictly distinguished from other parts?

jianc99 commented 4 weeks ago

Hi! Thank you for your attention to our work.

We estimated the time cost of each component during LLM inference based on LLM-Viewer, with some modifications to improve the accuracy of our estimations. We apologize for not citing this in the paper and will update the reference accordingly.

wutong4012 commented 3 weeks ago

Thanks for the reference.

Please forgive my ignorance, but I didn't see the time calculation for different parts in LLM-Viewer.

If it's convenient, can you provide more specific details? For example:

  1. Is the time package used to calculate the time?
  2. Does kv cache loading time refer to the time for cache update in attention, or does it include the process of flash attn?
  3. How are the time for Parameter load, Activation load and store, and Compute calculated?

Thank you again for your open source contribution.