ZcyMonkey / AttT2M

Code of ICCV 2023 paper: "AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism"
https://arxiv.org/abs/2309.00796
Apache License 2.0
37 stars 3 forks source link

Visualization of Motion-word cross-attention #4

Open jack111331 opened 3 months ago

jack111331 commented 3 months ago

Hi, I'd like to know how did you visualize the 2D and 3D heatmaps in "Figure.8 Motion-word cross-attention visualization" in your paper. The attention matrix in CrossAttention module has already been masked by lower triangle mask, which means there shouldn't be any attention left when the index of word token is larger than the index of motion, but it appears that the figure you presented in paper doesn't match this causal attention mechanism. Your figure instead presented that each motion received attention from each word token. Therefore, I'd like to ask that: Did you output the attention computed before lower triangle mask?

ZcyMonkey commented 3 months ago

Hi, I'd like to know how did you visualize the 2D and 3D heatmaps in "Figure.8 Motion-word cross-attention visualization" in your paper. The attention matrix in CrossAttention module has already been masked by lower triangle mask, which means there shouldn't be any attention left when the index of word token is larger than the index of motion, but it appears that the figure you presented in paper doesn't match this causal attention mechanism. Your figure instead presented that each motion received attention from each word token. Therefore, I'd like to ask that: Did you output the attention computed before lower triangle mask?

Yes, you are right, that's exactly what I did during the inference process for attention visualization. It might not be completely rigorous, but it was the only method available to me at the time for visualization.