Prerequisites

Before submitting your question, please ensure the following:

[x] I am running the latest version of PowerInfer. Development is rapid, and as of now, there are no tagged versions.
[x] I have carefully read and followed the instructions in the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).

Question Details

I'm trying to train the sparsity predictor by referring to DejaVu, but I have a strange finding. I generate the predictor training data on C4 by myself. For ReLULLaMA-7B, I train a predictor with higher recalls on C4 than that you provide in ReLULLaMA-7B-Predictor. (e.g., 0.94 v.s. 0.90 in layer 0)

However, when applying this predictor to PowerInfer, the efficiency is considerably lower than your ReLULLaMA-7B-Predictor. What is wrong with this? (The upper image is obtained by the predictor trained by myself. The lower image is obtained by ReLULLaMA-7B-Predictor.)

It is also probably due to some mistakes I made. Therefore, I also attach the code for generating the training data (get_llama_data.py and hf_llama_module.py), and for training the predictor (main_mlp.py, run_c4_mlp.sh, trainer_mlp.py).

Looking forward to your response! Of course, the best solution is to open-source the data, codes, or just parameter settings for training ReLULLaMA-7B-Predictor.

Thank you for your interest in our work. The open-source related predictor training code and settings are indeed in our plan. At present, we are doing more things that we consider to be of higher priority, such as optimizing cuda operators and refactoring code, as well as supporting mistral and mixtral models. I think the code you provided gave me a great inspiration for open-source training code for predictors. Now let me reply with some details about predictor training. Firstly, during the training process of the predictor, the hidden layer dimension is adaptive and not fixed at 1000. Secondly, the recall rate of PowerInfer's predictor is very high on OPT and Falcon models. Currently, due to the low sparsity of relullama itself, when obtaining activated neurons, we consider the top k%(such as 15%) neurons as activated neurons based on the L2-norm value output by the neurons. This means that the predictor does not need to be set very large. We have also been conducting some interesting experiments recently to further push more sparsity of Swiglu-based LLM in converting to reglu models. Anyway, we are organizing the code related to the predictor and hope to provide easy-to-use tools for open-source development in the future.

SJTU-IPADS / PowerInfer

Would you please kindly offer the data, codes, or settings for training the predictor? #124

Prerequisites

Question Details