ChenFengYe / motion-latent-diffusion

[CVPR 2023] Executing your Commands via Motion Diffusion in Latent Space, a fast and high-quality motion diffusion model

https://chenxin.tech/mld/

MIT License

575 stars 51 forks source link

How to add my own data to existing database and train VAE/MLD #7

Open mmdrahmani opened 1 year ago

mmdrahmani commented 1 year ago

Dear Chen Thanks for the amazing work. I have a question regarding the databases used for training VAE and MLD. I assume if the database is more diverse, the MLD generated actions would become more diverse. Is it possible to combine multiple public databases, such as HumanML3D, NTU, HumanAct12Pose, UESTC, ..., and my personal data to train the VAE/MLD? If so, could you explain the current data structure? For example, should we reshape the pose data for each action sample into [n_frames, n_joints, xyz], and then combine the different datasets? Thank you Mohammad

ChenFengYe commented 1 year ago

Hi Mohammad， Sorry for not getting back to you sooner. I agree that multiple diverse databases can achieve more strong motion generation framework. Combining public databases is possible because many datasets are based on SMPL representation. I have several suggestions.

About pose data (e.g. [n_frames, n_joints, XYZ]), I suggest you use motion representation similar to the HumanML3D dataset, which could be better for neural learning.
For the SMPL format, the shape parameters can influence joint positions, foot contact, and thus root joint. Although motion generation does not focus on body shape prediction, the shape parameters could avoid the foot-sliding issue. However, it still could occur during inference and we thus suggest 3D position or the HumanML3D format.
If you choose 3D joint position-like format, you need fit SMPL or other human body using the inverse kinematics.
I summarized the data structure in our paper - Appendix- E

https://arxiv.org/pdf/2212.04048.pdf

mmdrahmani commented 1 year ago

Thanks a lot for the details. I am going to use 3D joint positions (e.g. [n_frames, n_joints, XYZ]) for training VAE. Based on your description, VAE should still word with 3D joint positions. For now, I am not concerned with fitting 3D joint positions to SMPL model. I might also try 2D joint positions (e.g. [n_frames, n_joints, XY]) I will update the VAE training results later. Bests

mmdrahmani commented 1 year ago

Hi Again. cc @linjing7 Here Another question: The HumanML3D data that I have includes only 3D joint positions (I obtained based on humanml3d instructions). So, there are no velocity and rotations. I was wondering if I can use 3D joint positions for VAE training, and if so, what parts of the code and config files do I have to modify? Thanks

ChenFengYe commented 1 year ago

Actually, you do not need to modify anything specific. The original dimension of humanml3d format is 263. You should only pay attention to the embedding of this feature dimension. It should work well even if you do not change anything. https://github.com/ChenFengYe/motion-latent-diffusion/blob/17ba9e3ca881cee0c107c4ab02a56f8d751cd717/mld/models/architectures/mld_vae.py#L139

I am not sure about the performance or motion quality of this downgrade feature(263=>3*22). It could get a little bit worse.

mmdrahmani commented 1 year ago

Thanks @ChenFengYe I could actually run the VAE training on the humanml3d data (of the shape [n_frames, n_joints, XYZ]). There are checkpoints saved in experiments folder. I think I can specify these checkpoints (.ckpt files) as my model in the diffusion step, for PRETRAINED_VAE section of the config_mld_humanml3d.yaml. Is that correct?

I also have another basic question. I would like to understand the latent dimension of VAE. I'd like to know what the model has learned. Essentially, I am assuming if we could visualize the latent dimension, different actions would be clustered in different locations of the latent space. For example see the figure attached, for my analysis on the latent dimension of a simple VAE using mnist data. As you can see, the 10 digits are clearly clustered. I hope this kind of analysis is possible with mld-vae. (maybe I should open a new issue?)

I would appreciate it if you could give me some advice on this point. Thanks

vae_mnist_visualized

ChenFengYe commented 1 year ago

Yes, it is correct. You can put your vae path in the below for the diffusion training. https://github.com/ChenFengYe/motion-latent-diffusion/blob/279e0167c5370671f7ab523ae56310ec43e24439/configs/config_mld_humanml3d.yaml#L33

It is quite an interesting direction to discuss the understanding of latent space. I start a new issue for this discussion here.

12 Visualization and understanding of latent space

mmdrahmani commented 1 year ago

Got it. Thank you very much for your continuous support.