LinghaoChan / OpenTMA

OpenTMA: support text-motion alignment for HumanML3D, Motion-X, and UniMoCap
https://swanhub.co/demo/Evan/OpenTMR
29 stars 2 forks source link

OpenTMA: Open Text-Motion Alignment Project

๐Ÿ•บ Reproduced by Ling-Hao Chen and Shunlin Lu (credit also with TMR, SwanHub).

โ—๏ธ[Highlight]: We provide a demo for the OpenTMA in HumanTOMATO. The demo is supported by the SwanHub engineering team. Hav a try!

โœจ Quick Introduction

OpenTMA is a project that aims to provide a simple and efficient way to align text and motion data. It is designed to be easy to use and flexible, allowing users to align text and motion data in the latent space.

In the HumanTOMATO (ICML 2024) project, we clarify the importance of how to use the text and motion data to generate motions for the first time. We highlight the two method.

  • Replace your CLIP text encoder with OpenTMA text encoder.
  • Introduce the text-motion alignment supervision to your motion generation model during training.

๐Ÿ“ข News

โ˜‘๏ธ Todo List

๐Ÿš€ Quick start

Installation

pip install -r requirements.txt

Downloading Pretrained Checkpoints

We provide some pretrained checkpoints of OpenTMA for evaluation. Here are two methods to download the checkpoints. 1) You can download the checkpoints from the Google Drive. 2) You can download the checkpoints from the Baidu Drive (pwd: evan).

Usage

# Load text and motion data
import torch
from transformers import AutoTokenizer, AutoModel
from tma.models.architectures.temos.textencoder.distillbert_actor import DistilbertActorAgnosticEncoder
from tma.models.architectures.temos.motionencoder.actor import ActorAgnosticEncoder
from collections import OrderedDict

modelpath = 'distilbert-base-uncased'

textencoder = DistilbertActorAgnosticEncoder(modelpath, num_layers=4)
motionencoder = ActorAgnosticEncoder(nfeats=126, vae = True, num_layers=4)

"""
load model here
You need to normalize the motion data with mean and std.
For motionx, they are stored in './deps/t2m/motionx/vector_623/Comp_v6_KLD01/meta/*.npy'
"""

motion = torch.randn(1, 64, 126)    # B = 1, T = , D = , need normalization
lengths = [64]
print(textencoder(["a man is running"]).loc)
print(motionencoder(motion, lengths).loc)

๐Ÿƒ Model Training

1. Data Preparation

Our OpenTMA project supports three datasets: HumanML3D, Motion-X, and UniMoCap.

HumanML3D Data Preparation Please following the instructions in the [HumanML3D](https://github.com/EricGuo5513/HumanML3D?tab=readme-ov-file#how-to-obtain-the-data) repository to download and preprocess the data. The data should be stored in the `./datasets/humanml3d` folder. The path tree should look like this: ``` ./OpenTMR/datasets/humanml3d/ โ”œโ”€โ”€ all.txt โ”œโ”€โ”€ Mean.npy โ”œโ”€โ”€ new_joints/ โ”œโ”€โ”€ new_joint_vecs/ โ”œโ”€โ”€ Std.npy โ”œโ”€โ”€ test.txt โ”œโ”€โ”€ texts/ โ”œโ”€โ”€ train.txt โ”œโ”€โ”€ train_val.txt โ””โ”€โ”€ val.txt ```
Motion-X Data Preparation Please following the instructions in the [Motion-X](https://github.com/IDEA-Research/Motion-X?tab=readme-ov-file#-dataset-download) project. And then please follow the [HumanTOMATO](https://github.com/IDEA-Research/HumanTOMATO/tree/main/src/tomato_represenation) repository to preprocess the data into `tomatao` format. The data should be stored in the `./datasets/Motion-X` folder. The path tree should look like this: ``` ./OpenTMR/datasets/Motion-X โ”œโ”€โ”€ mean_std โ”‚ โ””โ”€โ”€ vector_623 โ”‚ โ”œโ”€โ”€ mean.npy โ”‚ โ””โ”€โ”€ std.npy โ”œโ”€โ”€ motion_data โ”‚ โ””โ”€โ”€ vector_623 โ”‚ โ”œโ”€โ”€ aist/ (subset_*/*.npy) โ”‚ โ”œโ”€โ”€ animation/ โ”‚ โ”œโ”€โ”€ dance/ โ”‚ โ”œโ”€โ”€ EgoBody/ โ”‚ โ”œโ”€โ”€ fitness/ โ”‚ โ”œโ”€โ”€ game_motion/ โ”‚ โ”œโ”€โ”€ GRAB/ โ”‚ โ”œโ”€โ”€ HAA500/ โ”‚ โ”œโ”€โ”€ humanml/ โ”‚ โ”œโ”€โ”€ humman/ โ”‚ โ”œโ”€โ”€ idea400/ โ”‚ โ”œโ”€โ”€ kungfu/ โ”‚ โ”œโ”€โ”€ music/ โ”‚ โ””โ”€โ”€ perform/ โ”œโ”€โ”€ split โ”‚ โ”œโ”€โ”€ all.txt โ”‚ โ”œโ”€โ”€ test.txt โ”‚ โ”œโ”€โ”€ train.txt โ”‚ โ””โ”€โ”€ val.txt โ””โ”€โ”€ texts โ”œโ”€โ”€ semantic_texts โ”‚ โ”œโ”€โ”€ aist/ (subset_*/*.txt) โ”‚ โ”œโ”€โ”€ animation/ โ”‚ โ”œโ”€โ”€ dance/ โ”‚ โ”œโ”€โ”€ EgoBody/ โ”‚ โ”œโ”€โ”€ fitness/ โ”‚ โ”œโ”€โ”€ game_motion/ โ”‚ โ”œโ”€โ”€ GRAB/ โ”‚ โ”œโ”€โ”€ HAA500/ โ”‚ โ”œโ”€โ”€ humanml/ โ”‚ โ”œโ”€โ”€ humman/ โ”‚ โ”œโ”€โ”€ idea400/ โ”‚ โ”œโ”€โ”€ kungfu/ โ”‚ โ”œโ”€โ”€ music/ โ””โ”€โ”€โ”€โ””โ”€โ”€ perform/ ```
UniMoCap Data Preparation Please following the instructions in the [UniMoCap](https://github.com/LinghaoChan/UniMoCap) repository to download and preprocess the data (HumanML3D, BABEL, and KIT-ML). The data should be stored in the `./datasets/UniMocap` folder. The path tree should look like this: ``` ./OpenTMR/datasets/UniMocap โ”œโ”€โ”€ all.txt โ”œโ”€โ”€ Mean.npy โ”œโ”€โ”€ new_joints/ (*.npy) โ”œโ”€โ”€ new_joint_vecs/ (*.npy) โ”œโ”€โ”€ Std.npy โ”œโ”€โ”€ test.txt โ”œโ”€โ”€ texts/ (*.txt) โ”œโ”€โ”€ train.txt โ”œโ”€โ”€ train_val.txt โ””โ”€โ”€ val.txt ```

2. Pretrained Checkpoints Used in the Evaluation

Here, we provide some pre-traind checkpoints for the evaluation. Here are two methods to download the checkpoints:

Google Drive Download the checkpoints from the [Google Drive](https://drive.google.com/drive/folders/1aWpJH4KTXsWnxG5MciLHXPXGBS7vWXf7?usp=share_link) and put them in the `./deps` folder. Please unzip the checkpoints via the following command: ``` unzip *.zip ``` Finally, the path tree should look like this: ``` ./deps โ”œโ”€โ”€ distilbert-base-uncased/ โ”œโ”€โ”€ glove/ โ”œโ”€โ”€ t2m/ โ””โ”€โ”€ transforms/ ```
Baidu Drive Download the checkpoints from the [Baidu Drive](https://pan.baidu.com/s/1SIwGDX2aDWTR4hLhUHrPlw?pwd=evan ) (pwd: `evan`) and put them in the `./deps` folder. Please unzip the checkpoints via the following command: ``` tar โ€“xvf deps.tar ``` Finally, the path tree should look like this: ``` ./deps โ”œโ”€โ”€ distilbert-base-uncased/ โ”œโ”€โ”€ glove/ โ”œโ”€โ”€ t2m/ โ””โ”€โ”€ transforms/ ```

3. Training

python -m train --cfg configs/configs_temos/H3D-TMR.yaml --cfg_assets configs/assets.yaml --nodebug
python -m train --cfg configs/configs_temos/MotionX-TMR.yaml --cfg_assets configs/assets.yaml --nodebug
python -m train --cfg configs/configs_temos/UniMoCap-TMR.yaml --cfg_assets configs/assets.yaml --nodebug

The checkpoints will be saved in the ./experiments/. If you would like to the debug mode, please remove the --nodebug flag. The best checkpoints often appear in the 100-500th epoch.

๐Ÿงช Test for Evaluation

Before running the code below, please revise the retreival.sh (like path1 variable) file to set the correct path for the data. This command should be used after training. It will evaluate the performance of the model on the test set with text and motion embeddings.

bash retreival.sh

The result will be in a markdown table format.

๐Ÿค๐Ÿผ Citation

If you use this repository for research, you need to cite:

@article{humantomato,
  title={HumanTOMATO: Text-aligned Whole-body Motion Generation},
  author={Lu, Shunlin and Chen, Ling-Hao and Zeng, Ailing and Lin, Jing and Zhang, Ruimao and Zhang, Lei and Shum, Heung-Yeung},
  journal={arxiv:2310.12978},
  year={2023}
}
@article{chen2023unimocap,
  title={UniMocap: Unifier for BABEL, HumanML3D, and KIT},
  author={Chen, Ling-Hao and UniMocap, Contributors},
  journal={https://github.com/LinghaoChan/UniMoCap},
  year={2023}
}
@inproceedings{petrovich23tmr,
    title     = {{TMR}: Text-to-Motion Retrieval Using Contrastive {3D} Human Motion Synthesis},
    author    = {Petrovich, Mathis and Black, Michael J. and Varol, G{\"u}l},
    booktitle = {International Conference on Computer Vision ({ICCV})},
    year      = {2023}
}
@InProceedings{Guo_2022_CVPR,
    author    = {Guo, Chuan and Zou, Shihao and Zuo, Xinxin and Wang, Sen and Ji, Wei and Li, Xingyu and Cheng, Li},
    title     = {Generating Diverse and Natural 3D Human Motions From Text},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022},
    pages     = {5152-5161}
}
@conference{AMASS2019,
  title = {AMASS: Archive of Motion Capture as Surface Shapes},
  author = {Mahmood, Naureen and Ghorbani, Nima and Troje, Nikolaus F. and Pons-Moll, Gerard and Black, Michael J.},
  booktitle = {International Conference on Computer Vision},
  pages = {5442--5451},
  month = oct,
  year = {2019},
  month_numeric = {10}
}

If you have any question, please contact Ling-Hao Chen (thu [DOT] lhchen [AT] gmail [DOT] com) and Shunlin Lu (shunilnlu0803 [AT] gmail [DOT] com).