Holistic-Motion2D

This is the official code release of Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space by Yuan Wang*, Zhao Wang*, Junhao Gong*, Di Huang, Tong He, Wanli Ouyang, Jile Jiao, Xuetao Feng, Qi Dou, Shixiang Tang†, Dan Xu.

📝 Changelog

[2024.11.04]: 🔥 Release the whole Holistic-Motion2D dataset. Find in Data Page.
[2024.06.20]: 🔥 Release a sample dataset. Find in Data Page. Code is coming soon!
[2024.09.10]: 🔥 Release the introduction and demo representations of our work, Holistic-Motion2D.

We present the first time a large-scale human motion benchmark, Holistic-Motion2D, including over 1M in-the-wild motion sequences, each paired with high-quality whole-body or partial pose annotations and textual descriptions.

Our Holistic-Motion2D dataset provides not only fine-grained and comprehensive whole-body motion annotations but also high-resolution information on the face and hands. We use multi-source datasets with images of varying resolutions to jointly train a human generative foundation model.

Our Holistic-Motion2D encompasses rich scenes, ranging from professional sports (playing tennis, skiing) and general daily actions (haircut, brushing teeth) to complex human-scene interactions (lying down, wall push-ups), capturing diverse environments such as indoor/outdoor landscapes, and dynamic action scenes.

The video quantity of our Holistic-Motion2D is $10\times$ larger than the previously largest 3D motion dataset Motion-X. Compared to MotionX, our Holistic-Motion2D features more sophisticated actions, longer motion sequences, and increased diversity.

Holistic-Motion2D is collected from 11 public video datasets along with two image datasets annotated by whole-body poses. Across 1M sequence-level motion sequences derived from in-the-wild scenarios, Holistic-Motion2D delivers 1M 2D whole-body pose annotations, complemented by 1M semantic descriptions.

Dataset Collection and Processing

The data collection pipeline includes holistic 2D motion and caption generation. This pipeline involves several pivotal stages: 1) gathering large-scale videos, 2) annotating 2D whole-body keypoints and confidence scores, 3) filtering high-quality motion sequences, 4) designing text prompts via the Large Language Model, 5) generating descriptive captions for sequence-level movements, 6) executing the manual inspections.

All data will be downloaded on Open-Data Lab:

Path	Size	Files	Format	Description
Holistic-Motion-2D-dataset	118.15 GB	1,464,278		Main folder
├ kpfiles	118.06 GB	400,790		Sequence of key points for character motion
├ ├ UCF101	5.03 GB	11,391	Pickle	Whole-body key-points for UCF101
├ ├ CAER	153.81 MB	3,542	Pickle	Facial key-points for CAER
├ ├ K400	55.54 GB	152,798	Pickle	Whole-body key-points for Kinetics-400
├ ├ InternVid	44.02 GB	85,665	Pickle	Whole-body key-points for InternVid
├ ├ K700	0	0	Pickle	Whole-body key-points for Kinetics-700
├ ├ IDEA400	6.33 GB	12,025	Pickle	Whole-body key-points for IDEA400
├ ├ sthv2	900.19 MB	106,661	Pickle	Hand key-points for Something-to-Something-v2
├ ├ UBody	3.21 GB	5,195	Pickle	Whole-body key-points for UBody
├ ├ DFEW	1.68 GB	15,524	Pickle	Facial key-points for DFEW
├ texts	101.05 MB	1,063,488		Caption for character motion video
├ ├ UCF101	4.68 MB	24,711	TXT	Texts for UCF101
├ ├ CAER	32.16 KB	4,574	TXT	Texts for CARE
├ ├ K400	40.6 MB	215,479	TXT	Texts for Kinetics-400
├ ├ InternVid	20.52 MB	421,894	TXT	Texts for InternVid
├ ├ K700	22.75 MB	141,611	TXT	Texts for Kinetics-700
├ ├ IDEA400	2.96 MB	12,025	TXT	Texts for IDEA400
├ ├ sthv2	8.05 MB	220,848	TXT	Texts for Something-to-Something-v2
├ ├ UBody	1.01 MB	5,974	TXT	Texts for UBody
├ ├ DFEW	456.1 KB	16,372	TXT	Texts for DFEW

2D text-driven whole-body motion generation model

Our Text-drivEN whole-boDy motion genERation (Tender), is tailored for 2D whole-body human motion synthesis. This model incorporates two novel designs to enhance the quality of generated motion: Part-aware Attention for Motion Variational Auto-Encoder (PA-VAE) and Confidence-Aware Generation (CAG).



MDM	MLD	T2M-GPT	Tender(Ours)

Notably, Tender consistently outperforms these benchmarks by generating more vivid and lifelike human motion sequences. Our Tender not only captures the nuanced dynamics of human movement but also enhances the fidelity and temporal consistency of the motions.

Downstream Applications

Using MagicAnimate, we dynamically animate a human character by applying our generated pose sequences, resulting in exceptionally lifelike and fluid animations that demonstrate the seamless integration of our Tender model with real-time video generation tools.

We employ MotionBERT to elevate these 2D human motions into 3D space, showcasing our model’s ability to facilitate complex 3D pose estimations. The lifted 3D motions maintain a high degree of smoothness and fidelity, making them suitable for applications in virtual reality (VR) and augmented reality (AR), where immersive and accurate 3D representations are essential.

License

Data License Confirmation and Author Responsibility.
- All the Holistic-Motion2D (including JSON metadata, download script, and documentation) is distributed under the CC-BY-NC-SA (Attribution-NonCommercial-ShareAlike) license to ensure its legitimate and widespread use.
- For the sub-datasets of Holistic-Motion2D, we would ask the user to read the original license of each original dataset, and we would only provide our annotated result to the user with the approvals from the original Institution. We confirm that our Holistic-Motion2D does not contain any personally identifiable information or offensive content.
- You can use, redistribute, and adapt it for non-commercial purposes, as long as you (a) give appropriate credit by citing our paper, (b) indicate any changes that you've made, and (c) distribute any derivative works under the same license.
Code License. The code for pre-processing and training our Tender model uses the MIT license. Please refer to the GitHub repository for license details.

Holistic-Motion2D / Tender

readme

Holistic-Motion2D

📝 Changelog

Dataset Collection and Processing

2D text-driven whole-body motion generation model

Downstream Applications

License