Closed adamgallas closed 8 months ago
Thank you for your interest! Please see https://huggingface.co/SparseLLM/prosparse-llama-2-7b/discussions/2
Thank you for your prompt response. I truly appreciate it. I have updated my question in huggingface. Thank you again for your help!
Clarification on Output Neuron Pruning Method in "Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time"
Hello,
I am attempting to replicate the findings of "Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time" and have some questions regarding the methodology for predicting and applying sparsity in the MLP layers of LLMs, specifically for models like llama2 7b.
Both the "Deja Vu" paper and subsequent works, such as "ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models," propose using two small, low-rank MLP layers to predict the output sparsity level of large MLP layers. These approaches suggest replacing typical activation functions (SiLU or GeLU) with ReLU and applying the Deja Vu method for sparsity prediction.
However, it is unclear how the determination is made regarding which output neurons should be pruned based on the output of the predictor layer. Is it more appropriate to identify the top-k indices of the predictor output as significant, or should a threshold-based method be applied, indicating that any predictor output below zero means the corresponding output of the real MLP + ReLU should be zero?
Code Snippet from Deja Vu's Training:
This code suggests a threshold-based approach to neuron pruning.
Deja Vu's Python Modeling:
In contrast, this snippet utilizes a top-k function for identifying active neurons.
PowerInfer's Implementation Snippet:
This implementation appears to adopt a threshold-based method, yet it's unclear how it aligns with the methods described in Deja Vu or ProSparse.
So how does the predictor work?
Given these observations and my own experimentation—where the top-k method proved effective but without clear guidance on selecting "k" due to the absence of details in ProSparse—I seek clarification on two fronts:
1 What is the recommended method for determining which neurons should be pruned: top-k or threshold-based?
2 How is the "k" value for the top-k method determined in practice, especially considering the variable sparsity levels across different models and tasks?
Any insights or clarifications on these points would be greatly appreciated, as they could significantly enhance the practical application and exploration of these promising sparsity techniques.
Thank you.