Ling-Hao Chenπ 1, 3, Shunlin Luπ 2, 3, Ailing Zeng3, Hao Zhang3, 4, Benyou Wang2, Ruimao Zhang2, Lei Zhangπ€ 3
πCo-first author. Listing order is random. π€Corresponding author.
1Tsinghua University, 2School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-SZ), 3International Digital Economy Academy (IDEA), 4The Hong Kong University of Science and Technology
This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding by leveraging the powerful capabilities of Large Language Models (LLMs). Diverging from recent LLMs designed for video-only or motion-only understanding, we argue that understanding human behavior necessitates joint modeling from both videos and motion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics and semantics effectively. In light of this, we present MotionLLM, a straightforward yet effective framework for human motion understanding, captioning, and reasoning. Specifically, MotionLLM adopts a unified video-motion training strategy that leverages the complementary advantages of existing coarse video-text data and fine-grained motion-text data to glean rich spatial-temporal insights. Furthermore, we collect a substantial dataset, MoVid, comprising diverse videos, motions, captions, and instructions. Additionally, we propose the MoVid-Bench, with carefully manual annotations, for better evaluation of human behavior understanding on video and motion. Extensive experiments show the superiority of MotionLLM in the caption, spatial-temporal comprehension, and reasoning ability.
We provide a simple online demo for you to try MotionLLM. Below is the guidance to deploy the demo on your local machine.
pip install -r requirements.txt
The author team would like to deliver many thanks to many people. Qing Jiang helps a lot with some parts of manual annotation on MoVid Bench and resolves some ethics issues of MotionLLM. Jingcheng Hu provided some technical suggestions for efficient training. Shilong Liu and Bojia Zi provided some significant technical suggestions on LLM tuning. Jiale Liu, Wenhao Yang, and Chenlai Qian provided some significant suggestions for us to polish the paper. Hongyang Li helped us a lot with the figure design. Yiren Pang provided GPT API keys when our keys were temporarily out of quota. The code is on the basis of Video-LLaVA, HumanTOMATO, MotionGPT. lit-gpt, and HumanML3D. Thanks to all contributors!
This code is distributed under an IDEA LICENSE. Note that our code depends on other libraries and datasets which each have their own respective licenses that must also be followed.
If you have any question, please contact at: thu [DOT] lhchen [AT] gmail [DOT] com AND shunlinlu0803 [AT] gmail [DOT] com.
## π Citation ```bash @article{chen2024motionllm, title={MotionLLM: Understanding Human Behaviors from Human Motions and Videos}, author={Chen, Ling-Hao and Lu, Shunlin and Zeng, Ailing and Zhang, Hao and Wang, Benyou and Zhang, Ruimao and Zhang, Lei}, journal={arXiv preprint arXiv:2405.20340}, year={2024} } ```