Questions about Attention Sparsity Implementation

I really enjoy your work in sparsity and successfully run the deja vu. But still I have some questions.

In hf_opt_sparse_mlp_attention.py https://github.com/FMInference/DejaVu/blob/master/Decentralized_FM_alpha/modules/hf_opt_sparse_mlp_attention.py, function prepare_head_mask. In the paper, deja vu seems only select the activate head for qkv computation. But in prepare_head_mask, it seems that all the qkv are computed and use the mask to simulate the sparsity computation result, which mismatch the discription in the paper. How to understand this part? Thanks!

FMInference / DejaVu

Questions about Attention Sparsity Implementation #26