exitudio / MMM

Official repository for "MMM: Generative Masked Motion Model"
https://exitudio.github.io/MMM-page/
62 stars 1 forks source link

About the word_emb for cross attention #8

Open buptxyb666 opened 4 months ago

buptxyb666 commented 4 months ago

Thanks for your great work! I wonder that the length of text usually less than 77. Why not mask the padding tokens in word_emb when performing cross attention?

exitudio commented 4 months ago

Hi, we use [MASK] tokens for generation by iterative decoding and [PAD] tokens to fill up the shorter length samples. [PAD] tokens in CLIP model can be in similar manner. Since we only use text tokens as a condition (not for generation), no need [MASK] tokens for text.

buptxyb666 commented 4 months ago

I mean that when perform cross attention between word embed (key and value) and motion token(query), will the [PAD] tokens from CLIP introduce the noise to motion token ?

Compared with global text condition, additionally using the fine-grained word embeds can bring performance gain ?

Look forward your reply.

exitudio commented 4 months ago

The model should learn to ignore [PAD] tokens (following CLIP). For more information, to get global (sentence) text embedding, CLIP simply applies linear layer to the local (word) embedding. https://github.com/openai/CLIP/blob/main/clip/model.py#L343-L356

We create a wrapper class here: https://github.com/exitudio/MMM/blob/2f7e3b25234a7fd0de32c7773eb5c39453500d66/train_t2m_trans.py#L76-L80

Applying local text embedding shows the trade-off between R-precision and FID score. Please see table 9 in the supplementary.