Prerequisites

Before submitting your question, please ensure the following:

[ ] I am running the latest version of PowerInfer. Development is rapid, and as of now, there are no tagged versions.
[ ] I have carefully read and followed the instructions in the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).

Question Details

This a question about the PowerInfer-2 paper, not this codebase.

Based on my understanding, the term "co-activation" describes the phenomenon within an FFN where the i-th feature of the hidden states is zero. As a result, the corresponding rows or columns in the gate, up, or down matrices can also be zeroed out or discarded entirely. In my opinion, "co-activation" refers to a phenomenon within the FFN and should NOT be related to the sparsity predictor. However, in section 4.3 of the paper, the following statement is made:

Moreover, considering the co-activation likelihood of 80% within these bundles, there is still nearly 20% probability that these bundled neurons are not co-activated. Thus, combining the two 4KB random reads could potentially lead to bandwidth wastes. To mitigate this, for models using 4-bit quantization, PowerInfer-2 delays the second 4KB read until the outcomes from the Gate neuron multiplications are obtained. Specifically, PowerInfer-2 uses the predictor to determine the activation of neurons within the Gate matrix, initiating the load for the first part of the bundle based on this information. Afterwards, if the output from the Gate neuron (passing through the activation function) is non-zero, PowerInfer-2 proceeds to load the second part of the bundle, thus minimizing unnecessary I/O operations.

This passage seems to suggest that co-activation is indeed related to the sparsity predictor, as the 20% of "not co-activated" instances stem from the mismatch between the results of the sparsity predictor and the actual computation of the gate matrix.

In short, my question is: does "co-activation" describe a phenomenon inherent to the FFN, or does it refer to the matching (or lack thereof) between the sparsity predictor and the gate matrix?

SJTU-IPADS / PowerInfer

What does "co-activation" mean in Section 4.3 of the PowerInfer-2 paper? #222

Prerequisites

Question Details

Additional Context