SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
MIT License
7.97k stars 415 forks source link

What does "co-activation" mean in Section 4.3 of the PowerInfer-2 paper? #222

Closed exhyy closed 1 month ago

exhyy commented 2 months ago

Prerequisites

Before submitting your question, please ensure the following:

Question Details

This a question about the PowerInfer-2 paper, not this codebase.

Based on my understanding, the term "co-activation" describes the phenomenon within an FFN where the i-th feature of the hidden states is zero. As a result, the corresponding rows or columns in the gate, up, or down matrices can also be zeroed out or discarded entirely. In my opinion, "co-activation" refers to a phenomenon within the FFN and should NOT be related to the sparsity predictor. However, in section 4.3 of the paper, the following statement is made:

Moreover, considering the co-activation likelihood of 80% within these bundles, there is still nearly 20% probability that these bundled neurons are not co-activated. Thus, combining the two 4KB random reads could potentially lead to bandwidth wastes. To mitigate this, for models using 4-bit quantization, PowerInfer-2 delays the second 4KB read until the outcomes from the Gate neuron multiplications are obtained. Specifically, PowerInfer-2 uses the predictor to determine the activation of neurons within the Gate matrix, initiating the load for the first part of the bundle based on this information. Afterwards, if the output from the Gate neuron (passing through the activation function) is non-zero, PowerInfer-2 proceeds to load the second part of the bundle, thus minimizing unnecessary I/O operations.

This passage seems to suggest that co-activation is indeed related to the sparsity predictor, as the 20% of "not co-activated" instances stem from the mismatch between the results of the sparsity predictor and the actual computation of the gate matrix.

In short, my question is: does "co-activation" describe a phenomenon inherent to the FFN, or does it refer to the matching (or lack thereof) between the sparsity predictor and the gate matrix?

Additional Context

No