ZcyMonkey / AttT2M

Code of ICCV 2023 paper: "AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism"
https://arxiv.org/abs/2309.00796
Apache License 2.0
37 stars 3 forks source link

Question on the VQVAE spatial encoder design #2

Open kingchurch opened 6 months ago

kingchurch commented 6 months ago

Thank you for sharing the great work ! I have a question regarding the design choice of the VQVAE spatial encoder. Currently only the encoder has included the spatial transformer to encode the relationship between joints of the same pose and then the TCN is applied to encode the temporal dependencies. The VQVAE decoder however remains unchanged with just the reverse TCN. Why didn't we add a similar spatial transformer to the VQVAE decoder map the encoded body parts' pose to the individual joint's pose ? On the surface it seems that the reverse TCN in the decoder need to do a lot more work than the TCN in the encoder to decode not only the temporal encoding, but also the spatial encoding. Do you consider the decoding problem easier than the encoding problem and therefore no transformer is needed for the former ?

ZcyMonkey commented 6 months ago

Thank you for sharing the great work ! I have a question regarding the design choice of the VQVAE spatial encoder. Currently only the encoder has included the spatial transformer to encode the relationship between joints of the same pose and then the TCN is applied to encode the temporal dependencies. The VQVAE decoder however remains unchanged with just the reverse TCN. Why didn't we add a similar spatial transformer to the VQVAE decoder map the encoded body parts' pose to the individual joint's pose ? On the surface it seems that the reverse TCN in the decoder need to do a lot more work than the TCN in the encoder to decode not only the temporal encoding, but also the spatial encoding. Do you consider the decoding problem easier than the encoding problem and therefore no transformer is needed for the former ?

We considered this at first, but didn't do it for several reasons: first, as you mentioned, we thought that encoding was more important for modeling human motion than decoding. This conclusion is drawn from many previous Encoder-Decoder based human motion modeling work (such as action recognition, motion prediction, and motion generation), where special design is focused on the encoder, and the decoder is usually relatively simple. Second, transformer is a data-hungry structure. There is still relatively limited training data for text to motion task, so we are not sure whether adding additional transformer structures to the decoder is good for the final effect. Of course, we haven't rigorously proven this idea, and for various reasons, we didn't delve deeper into the issue at the time. It may well be a potential way to improve the situation, which is a question worth exploring.