JunzheJosephZhu / see_hear_feel

32 stars 0 forks source link

About attention score #4

Open weixiang-smart opened 11 months ago

weixiang-smart commented 11 months ago

Hello! @JunzheJosephZhu It's an excellent work of multimodal robot learning.

I'm confused about how to normalize the attention scores across all modalities. I would appreciate it if you could provide the calculation process of the attention score of each modal in Fig.5 of the paper. Thank you so much!

JunzheJosephZhu commented 11 months ago

mha_out, weights = self.mha(mlp_inp, mlp_inp, mlp_inp) # [1, batch, D] This line computes the attention score

weixiang-smart commented 11 months ago

Thank you for your patient reply. @JunzheJosephZhu The output of the attention score is a matrix with the shape of 3x3. I would appreciate it if you could provide the calculation process of the transformation from weight matrix to attention score of each modal. Thank you so much!

JunzheJosephZhu commented 11 months ago

So, each modality generates a 768 or 512 dimensional token, we concatenate those tokens and treat it as a length=3 sequence and perform self attention

On Mon, Oct 9, 2023 at 22:58 Liang Weixiang @.***> wrote:

Thank you for your patient reply. @JunzheJosephZhu https://github.com/JunzheJosephZhu The output of the attention score is a matrix with the shape of 3x3. I would appreciate it if you could provide the calculation process of the transformation from weight matrix to attention score of each modal. Thank you so much!

— Reply to this email directly, view it on GitHub https://github.com/JunzheJosephZhu/see_hear_feel/issues/4#issuecomment-1753174266, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2C6GZ6BD2K5OFJAIOAMC3X6QGKBAVCNFSM6AAAAAA5BAIHIWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJTGE3TIMRWGY . You are receiving this because you were mentioned.Message ID: @.***>

weixiang-smart commented 10 months ago

Thank you for your patient explanation. @JunzheJosephZhu I'm still confused about cross-time attention. For each modality, is the code implemented to concatenate the observations at used time steps and then generate 768 or 512 dimensional tokens through the encoder?