In your paper, the inference memory consumption is about O(N) and is directly proportion to w. However, I do not understand where did 'w' comes from. In fact, I'm wondering why there is an additional block in your figure 3 on the left. Also, when moving from the being generated blocks to the to be generated blocks. Isn't there some KV-cache will be abandoned? I don't really understand why there is (w+n)M_2 in your formula.
文中的KV仅与每个block和相邻的三个block有关,而nn blocks 的KV-cache 不应该是nn M2或者nn*4 M2么, width是哪来的? 求解
I'm more than delighted and thankful to your explanation.
When generating the white block in the lower-left corner later, the KV cache of "an additional block in your figure 3 on the left" is needed, so we need to keep it in memory
In your paper, the inference memory consumption is about O(N) and is directly proportion to w. However, I do not understand where did 'w' comes from. In fact, I'm wondering why there is an additional block in your figure 3 on the left. Also, when moving from the being generated blocks to the to be generated blocks. Isn't there some KV-cache will be abandoned? I don't really understand why there is (w+n)M_2 in your formula. 文中的KV仅与每个block和相邻的三个block有关,而nn blocks 的KV-cache 不应该是nn M2或者nn*4 M2么, width是哪来的? 求解 I'm more than delighted and thankful to your explanation.