Closed neuer93 closed 1 year ago
I think this code is used for accuracy evaluation, not latency code.
I think this code is used for accuracy evaluation, not latency code.
Thanks for the explanation! Do you know if they have the code for latency evaluation, i.e., using sparsity predictor to reduce the end-to-end latency for LLM inference?
Thanks
Hi, @neuer93 Did you find out the latency code? I have the same confusion about how the sparsity predictor to reduce the latency. Thanks.
Hi,
Thanks for sharing the code online, and I really enjoy reading your paper.
When I read the paper, I have a confusion on the implementation of the sparse MLP in
Decentralized_FM_alpha/modules/hf_opt_sparse_mlp_attention.py
line 552 to line 558.Based on my understanding of the paper, the predictor should decide which neurons to be pruned before calculating fc1. Here, I find the mask is applied after fc1, which means that the computation on fc1 is not sparse. I am not sure if I miss anything here. Is this the right file (or function) of the sparse MLP with the sparsity predictor?
Thanks