This a question about the PowerInfer-2 paper, not this codebase.
Based on my understanding, the term "co-activation" describes the phenomenon within an FFN where the i-th feature of the hidden states is zero. As a result, the corresponding rows or columns in the gate, up, or down matrices can also be zeroed out or discarded entirely. In my opinion, "co-activation" refers to a phenomenon within the FFN and should NOT be related to the sparsity predictor. However, in section 4.3 of the paper, the following statement is made:
Moreover, considering the co-activation likelihood of 80% within these bundles, there is still nearly 20% probability that these bundled neurons are not co-activated. Thus, combining the two 4KB random reads could potentially lead to bandwidth wastes. To mitigate this, for models using 4-bit quantization, PowerInfer-2 delays the second 4KB read until the outcomes from the Gate neuron multiplications are obtained. Specifically, PowerInfer-2 uses the predictor to determine the activation of neurons within the Gate matrix, initiating the load for the first part of the bundle based on this information. Afterwards, if the output from the Gate neuron (passing through the activation function) is non-zero, PowerInfer-2 proceeds to load the second part of the bundle, thus minimizing unnecessary I/O operations.
This passage seems to suggest that co-activation is indeed related to the sparsity predictor, as the 20% of "not co-activated" instances stem from the mismatch between the results of the sparsity predictor and the actual computation of the gate matrix.
In short, my question is: does "co-activation" describe a phenomenon inherent to the FFN, or does it refer to the matching (or lack thereof) between the sparsity predictor and the gate matrix?
Prerequisites
Before submitting your question, please ensure the following:
Question Details
This a question about the PowerInfer-2 paper, not this codebase.
Based on my understanding, the term "co-activation" describes the phenomenon within an FFN where the i-th feature of the hidden states is zero. As a result, the corresponding rows or columns in the
gate
,up
, ordown
matrices can also be zeroed out or discarded entirely. In my opinion, "co-activation" refers to a phenomenon within the FFN and should NOT be related to the sparsity predictor. However, in section 4.3 of the paper, the following statement is made:This passage seems to suggest that co-activation is indeed related to the sparsity predictor, as the 20% of "not co-activated" instances stem from the mismatch between the results of the sparsity predictor and the actual computation of the gate matrix.
In short, my question is: does "co-activation" describe a phenomenon inherent to the FFN, or does it refer to the matching (or lack thereof) between the sparsity predictor and the gate matrix?
Additional Context
No