andy-yang-1 / DoubleSparse

16-fold memory access reduction with nearly no loss
MIT License
42 stars 1 forks source link

Question about offline calibration #4

Open vnkc1 opened 1 week ago

vnkc1 commented 1 week ago

@andy-yang-1 My team really appreciate your work on Double Sparsity! Thank you for everything you do!

Question - In group_channel_config.py, the scores for identifying outlier channels are decided via q * k, which is the dot product of an individual token's query and key vectors.

However, it is possible that an outlier (or feature detection) occurs via channel C' during the interaction of query vector of token T and the key vector of let's say token T-3 (and this channel C' does not show strong outlier behavior when multiplied within key-value vectors of either token T or T-3)

I am curious about how the offline calibration approach works, if only interactions between query and key vectors of the same token is considered to measure strength of outlier channels. Thanks!

andy-yang-1 commented 6 days ago

@vnkc1 Thank you very much for your reply! I think what you said makes sense. In my experiments, I only verified the effect of q * k for the same token, but it’s very likely that cross attention might yield better results. I will try the method you suggested.