SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
MIT License
7.9k stars 406 forks source link

How to calculate the number of activated params in TurboSparse paper? #195

Closed ustcwhy closed 3 months ago

ustcwhy commented 3 months ago

Question Details

Thanks for your wonderful work TurboSparse.

It is amazing to have only 2.5B activated params of Mistral-7B but achieve good performance.

However, I wonder how this number "2.5B" is calculated. I check the code released in HF. It seems that only the input of down proj of each GLU layer is sparsified. Even if the sparsity raio reaches 100%, the params of down proj only accounts for 2/9 of the model.

YixinSong-e commented 3 months ago

Thank you for your interest! Actually we consider the corresponding rows in the up and gate projections and the columns in the down projection as a single neuron. When the intermediate result after applying the activation function is zero, we regard the corresponding neurons in the up, gate, and down projections as inactive. For example, if the first element of the vector is zero after passing through the up, gate, and ReLU activation functions, we consider the first row weights in the up and gate projections, as well as the first column weights in the down projection, to be inactive. Refers to DejaVu, PowerInfer that we can predict to skip rows in gate and up projection matrix.

If 90% intermediate element is zero, we regard only 10% params are activated in FFN.

ustcwhy commented 3 months ago

Thank you for your interest! Actually we consider the corresponding rows in the up and gate projections and the columns in the down projection as a single neuron. When the intermediate result after applying the activation function is zero, we regard the corresponding neurons in the up, gate, and down projections as inactive. For example, if the first element of the vector is zero after passing through the up, gate, and ReLU activation functions, we consider the first row weights in the up and gate projections, as well as the first column weights in the down projection, to be inactive. Refers to DejaVu, PowerInfer that we can predict to skip rows in gate and up projection matrix.

If 90% intermediate element is zero, we regard only 10% params are activated in FFN.

Thanks for your comment! I get it.