1:1 mapping between motion and latent space assumed for in-painting?

AntiLibrary5 commented 10 months ago

Hi, Great codebase and extensive results. I was wondering that for your in-painting experiment, since your masked generation works at the latent space of vqvae, how do you ensure that for example, for in-painting [118-150] frames of a given sequence will correspond to certain specific tokens in the latent space?

https://github.com/EricGuo5513/momask-codes/blob/500ffe6de4b39d13197a3120267d743f8002784c/edit_t2m.py#L131

Thank you for clarification.

Murrol commented 10 months ago

Hi, thank you for your interest.

Firstly, I suppose you notice that the correspondence is actually 4 frames to 1 token. Therefore, the in-painting section (e.g., [118-115]) would be rounded by 4.

Secondly, for the latent-motion correspondence, we construct our VQ-VAE with a shallow 1D convolutional network. The convolutional network inherently preserves the structure of the data (i.e., sequence), and the shallow network further provides a relatively small perception field. So, while we can't assert that 1 token precisely maps to a specific 4-frame motion clip, it should contain dominant information from the temporally corresponding 4-frame motion clip. Masking this token will then erase the 4-frame motion clip for in-painting.

Hope my answer solve your question.

AntiLibrary5 commented 10 months ago

Okay I understood along the same lines. Much clearer now. Thanks!

EricGuo5513 / momask-codes

1:1 mapping between motion and latent space assumed for in-painting? #16