PrincetonUniversity / LLMCompass

BSD 3-Clause "New" or "Revised" License
69 stars 16 forks source link

If we need to consider the latency of memory model? #2

Open shirohasuki opened 2 months ago

shirohasuki commented 2 months ago

Hi, very AMAZING work! I like this. Thank you for your contributions.

There is one thing puzzling me. I've seen that both computation and io_module have latency in code, but I don't find latency of the memory model——like we getting data from the loacl buffer(SRAM) or global buffer(HBM) (not latency on link).

If you've already mentioned in your paper that I apologize for my omission.

Thanks!

HenryChang213 commented 2 weeks ago

Global buffer is actually L2 cache in most of the cases. For these buffers, LLMCompass assumes only the latency of the first memory access is exposed. The following latency will be well hidden. What matters is bandwidth.