Closed Githubhgh closed 3 months ago
Hi, thank you for interest in our work. We adopt this setting from T2M-GPT. I have asked the same question here.
First, let me clarify the detail, VQVAE is using 64 frames and down sampling to 16 tokens. While Transformers is working on token space only, which has maximum length of 49 tokens + 1 end token = 50 tokens (49 tokens x 4 = 196 frames).
Here is the explanation:
In the 1st stage, the Encoder and Decoder of VQ-VAE are Convolution so VQVAE can learn dynamic length due to the sliding window. The locality of Convolution restricts it to see only adjacent frames. So 64 frames is large enough to learn. Rather than learn 196 and put some padding for the short motions to learn every motion in parallel. Note that the shortest motion is 40 frames, we ignore some short motions and no performance drop.
In the 2nd stage, Transformers needs to see maximum frames so 49 tokens (or 196) is required. And the pad tokens are put in for shorter length motion.
Great, thanks for clarifying.
Hi, thanks for this wonderful work, may I ask why in the t2m_trans training, the max length is only 50, which is different from 64 of the vq training stage, whether this will effect the final result?