Closed Becomebright closed 1 month ago
Scale refers to channel-wise multiplication of video features with a scale vector given by the text encoder, similar to 2D-TAN.
Thanks for your quick reply. Based on my comprehension, the implementation of the "Scale" fusion is as follows:
Could you confirm if my understanding is accurate?
We do not generate 2D temporal feature maps from Z. We simply prepend a [CLS] token to the text input and use its embedding as the scaling weights E. We then multiple every feature in Z with E.
Mark as solved.
Hi, I appreciate your work and noticed in Table 3 that the "Scale" fusion outperforms "Cat" and "Add." Could you provide more information on how the "Scale" method is implemented?
Thank you.