Closed wangjg33 closed 1 year ago
Congratulations, this is an excellent job!
I have a question when reading the code and paper. The paper mentioned that the k parameter of top-k can be dynamically learned, but when I read the code, it seems to be a fixed K value. The specific code should be on lines 140 to 154 of the DRSformer_arch. py file . Here.
Could you please help me understand how K mentioned in the paper is dynamically learned?
Thanks for your attention! Here, K is formally obtained by weighted average of some proper fractions, rather than a single value. In other words, the sparsity level is dynamically learnable, rather than K.
Thank you for your reply. I think I understand what you mean. But I also noticed another thing. In the code here. The attention within the range of (4/5,1) will be increased four times due to the addition of four times, resulting in the final "out" value being increased four times. (3/4, 4/5) and (2/3, 3/4) have been expanded by 3 and 2 times, respectively. Is it like this?. Additionally, dividing the attention matrix into four calculations may increase the computational complexity by four times?
Congratulations, this is an excellent job!
I have a question when reading the code and paper. The paper mentioned that the k parameter of top-k can be dynamically learned, but when I read the code, it seems to be a fixed K value. The specific code should be on lines 140 to 154 of the DRSformer_arch. py file . Here.
Could you please help me understand how K mentioned in the paper is dynamically learned?