Open hodlen opened 10 months ago
I believe that a statistical method could be employed to set all outputs of non-ReLU activation functions that are below, for instance, the 30th percentile to zero, in a similar manner to obtain sparsity guarantees akin to those provided by ReLU.
It's also important to keep MoE models in mind when you expand the compatibility of PowerInfer. The ceiling for consumer grade GPUs is around 3_0 for a 8x7b so if Powerinfer can easily handle 5_k_m or even 6_k for a 8x7b, then it will really be good news.
Create a ReLu version for the popular Mixtral Instruct v0.1 (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) and the dolphin fine tuned (https://huggingface.co/cognitivecomputations/dolphin-2.7-mixtral-8x7b), and people will start taking this project seriously.
It's also important to keep MoE models in mind when you expand the compatibility of PowerInfer. The ceiling for consumer grade GPUs is around 3_0 for a 8x7b so if Powerinfer can easily handle 5_k_m or even 6_k for a 8x7b, then it will really be good news.
Create a ReLu version for the popular Mixtral Instruct v0.1 (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) and the dolphin fine tuned (https://huggingface.co/cognitivecomputations/dolphin-2.7-mixtral-8x7b), and people will start taking this project seriously.
Thank you for your insight. Actually we are training mixtral now. Please wait for our updates. :)
Hi @YixinSong-e . I notice that you provide the ReLU-LLaMA in HF. I run the model and found that the sparsity (values lower than zero) is much lower than OPT models, which could even achieve 99%. ReLU-LLaMA can only achieve about 70~80%. It seems has degradation on the sparse matmul, considering that much lower sparsity is observed.
Hi @YixinSong-e . I notice that you provide the ReLU-LLaMA in HF. I run the model and found that the sparsity (values lower than zero) is much lower than OPT models, which could even achieve 99%. ReLU-LLaMA can only achieve about 70~80%. It seems has degradation on the sparse matmul, considering that much lower sparsity is observed.
Hello @llCurious. Yes, for now relu-llama has limited sparsity due to GLU varient, and thus the acceleration ratio of relullama is also relatively less compared to OPT. Interestingly, we found in the reglu activation function that even if there are some activation values that are not 0, they can still be ignored.
To push more sparisity in GLU based model, we currently did some experiments on mistral, which we will release recently.
Thanks for your reply. I have a question that in my understanding, ReGLU uses element-wise multiplication, which means those zero values after ReLU remain zero, theoretically yilelding same sparsity level as ReLU?
BTW, i wonder how do you calculate the CDF in Figure 5 (power-law activation).
Thanks for your reply. I have a question that in my understanding, ReGLU uses element-wise multiplication, which means those zero values after ReLU remain zero, theoretically yilelding same sparsity level as ReLU?
BTW, i wonder how do you calculate the CDF in Figure 5 (power-law activation).
First, zero values after ReLU remain zero is right. Further, some value after ReLU multiplication with GLU output is very close to zero, which can also be ignored. We will provide a specific explanation of this phenomenon in a paper (in the coming weeks)
Second, we do this by collecting the number of activations of all neurons in a given corpus. Then we calculate the CDF of activation counts by sorting the neurons in descending order of activation counts.
Thanks for your reply. I have a question that in my understanding, ReGLU uses element-wise multiplication, which means those zero values after ReLU remain zero, theoretically yilelding same sparsity level as ReLU? BTW, i wonder how do you calculate the CDF in Figure 5 (power-law activation).
First, zero values after ReLU remain zero is right. Further, some value after ReLU multiplication with GLU output is very close to zero, which can also be ignored. We will provide a specific explanation of this phenomenon in a paper (in the coming weeks)
Second, we do this by collecting the number of activations of all neurons in a given corpus. Then we calculate the CDF of activation counts by sorting the neurons in descending order of activation counts.
Do you have plans to release the code for the profiler that collects the activation statistics ? it would be great to evaluate various models and working points. thanks !
hi @hodlen . I notice that you provide ReLUFalcon-40B in the HF. Do you have the tuned ReLU-Falcon-7B weights?
hi @hodlen . I notice that you provide ReLUFalcon-40B in the HF. Do you have the tuned ReLU-Falcon-7B weights?
We haven't tuned the Falcon 7B model and currently have no plan to do so. After reviewing benchmarks performances, we've opted to focus on our tuning efforts on Mistral 7B which has demonstrated to be a more robust foundation model for this scale.
PowerInfer currently optimizes for LLMs (Large Language Models) that utilize the ReLU activation function, leveraging their internal activation locality. However, many of the trending models do not use ReLU activation, creating a significant gap in PowerInfer's applicability.
This ongoing issue tracks our efforts to onboard new LLMs, particularly those in high demand within the community, and to continually enhance our existing ReLU-based LLMs.
Onboarding Progress
We're actively fine-tuning models into ReLU sparse models:
Inviting broader participation, we're also:
Onboarding New Models
We recognize that fine-tuning upstream models is computationally intensive, and the requirement for high-quality data often surpasses our current capabilities. As such, we are actively seeking industrial collaborations to unlock more of PowerInfer's potential and bring state-of-the-art models to a wider audience. For direct inquiries and partnership discussions, please contact us at yzmizeyu@sjtu.edu.cn.
We will also focus on models that have garnered significant interest in our community 🌟. Your input and feedback are highly valued and encouraged! 💬👍