Closed AntiLibrary5 closed 10 months ago
Hi, thank you for your interest.
Firstly, I suppose you notice that the correspondence is actually 4 frames to 1 token. Therefore, the in-painting section (e.g., [118-115]) would be rounded by 4.
Secondly, for the latent-motion correspondence, we construct our VQ-VAE with a shallow 1D convolutional network. The convolutional network inherently preserves the structure of the data (i.e., sequence), and the shallow network further provides a relatively small perception field. So, while we can't assert that 1 token precisely maps to a specific 4-frame motion clip, it should contain dominant information from the temporally corresponding 4-frame motion clip. Masking this token will then erase the 4-frame motion clip for in-painting.
Hope my answer solve your question.
Okay I understood along the same lines. Much clearer now. Thanks!
Hi, Great codebase and extensive results. I was wondering that for your in-painting experiment, since your masked generation works at the latent space of vqvae, how do you ensure that for example, for in-painting [118-150] frames of a given sequence will correspond to certain specific tokens in the latent space?
https://github.com/EricGuo5513/momask-codes/blob/500ffe6de4b39d13197a3120267d743f8002784c/edit_t2m.py#L131
Thank you for clarification.