The (hidden) Mamba attention matrices are extracted from each S6 channel. We selected 4 representative channels for visualization. Similarly, for transformers, we focus on a single representative head. Please note that channels are not aggregated in this figure.
To present a comparative visualization, it's necessary to normalize both Mamba and transformer attention matrices to the same domain. A natural approach is to normalize scores obtained from Mamba to a 0-1 range. This normalization can be achieved via softmax (just like attention) or by using min-max normalization. We acknowledge that using softmax could be problematic for negative attention scores. Therefore, in the second version of our paper, we employ min-max normalization on the absolute values of the scores, as illustrated in Figure 4 of version 2 of our paper.
Thank you for your work. I have learned a lot of useful insights about mamba from it.
I am a bit confused about why mamba's attention has 4 columns in Figure 3. Also, how is the attention from different channels of mamba aggregated?