Clarification on the "Scale" Fusion Method Implementation

fmu2 / snag_release

Official Implementation of SnAG (CVPR 2024)

21 stars 2 forks source link

Clarification on the "Scale" Fusion Method Implementation #3

Closed Becomebright closed 1 month ago

Becomebright commented 1 month ago

Hi, I appreciate your work and noticed in Table 3 that the "Scale" fusion outperforms "Cat" and "Add." Could you provide more information on how the "Scale" method is implemented?

Thank you.

fmu2 commented 1 month ago

Scale refers to channel-wise multiplication of video features with a scale vector given by the text encoder, similar to 2D-TAN.

Becomebright commented 1 month ago

Thanks for your quick reply. Based on my comprehension, the implementation of the "Scale" fusion is as follows:

The transformation of $Z \in \mathbb{R}^{N \times D}$ into a 2D temporal feature map via max-pooling, resulting in $F \in \mathbb{R}^{N \times N \times D}$.
Element-wise multiplication of the 2D feature map with the text embedding $E \in \mathbb{R}^{D}$.

Could you confirm if my understanding is accurate?

fmu2 commented 1 month ago

We do not generate 2D temporal feature maps from Z. We simply prepend a [CLS] token to the text input and use its embedding as the scaling weights E. We then multiple every feature in Z with E.

fmu2 commented 1 month ago

Mark as solved.