Closed biubiubiiu closed 2 years ago
As indicated in the paper, sigma is a negative scalar, with a large absolute value. This number is added for the attention map with the mask map. And for the area which is added with sigma, its value after softmax will be zero, meaning that these areas (these areas normally have low SNR) will not be included in the attention computation.
Thank you for your clarification.
As indicated in the paper, sigma is a negative scalar, with a large absolute value. This number is added for the attention map with the mask map. And for the area which is added with sigma, its value after softmax will be zero, meaning that these areas (these areas normally have low SNR) will not be included in the attention computation.
Equation 6 is the paper is
I'm confused by the element-wise addition here. Since the SNR map is used to mask out tokens (i.e. patches) with low SNR, shouldn't the element-wise add be the element-wise product here? I.e.:
$$ \operatorname{Softmax}\left(\frac{\mathbf{Q}{i, b} \mathbf{K}^T{i, b}}{\sqrt{db}} \circ \left(1-\mathcal{S}^{\prime}\right) \sigma\right) \mathbf{V}{i, b} $$
https://github.com/dvlab-research/SNR-Aware-Low-Light-Enhance/blob/d55031160422be092ec526c465654c436e12a2b5/models/archs/transformer/Modules.py#L15-L24