Mael-zys / T2M-GPT

(CVPR 2023) Pytorch implementation of “T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations”
https://mael-zys.github.io/T2M-GPT/
Apache License 2.0
587 stars 52 forks source link

Motion sequences relevance to input texts #11

Closed Crane-YU closed 1 year ago

Crane-YU commented 1 year ago

Dear @Mael-zys ,

Thanks for sharing this work. While I was playing around with a random input: "a man rises from the ground, walks in a circle and dances.", I got a quite weird result where the rising action is totally ignored (shown below with the 1st frame captured). However, when I changed my text prompts to your given text: "a man rises from the ground, walks in a circle and sits back down on the ground.", the rising behavior is clearly shown in the first frame. Are we supposed to get this behavior? end_with_dance

Mael-zys commented 1 year ago

Hello, this is a common problem in this domain. For long texts or texts that contain multiple actions, the generated motion might miss some details of the textual description. (We also present one failure case in our project page) Note that this typical failure case exists for all competitive approaches.

XiSHEN0220 commented 1 year ago

Thanks for your interest in our work! Another reason woule be the fact that the training set is not very large (14,616 motion sequences). Therefore, the approach is not robust to different text prompt. In the paper (Sec. 4.3), we also include a simple analysis about the used dataset.

Crane-YU commented 1 year ago

Hi @XiSHEN0220 @Mael-zys, Thx for the reply. According to what @Mael-zys said: "For long texts or texts that contain multiple actions, the generated motion might miss some details of the textual description", I was aware of this problem so I only choose to contain 3 actions in the text input as the demo you present in the project page. However, I do feel this is kinda weird because the word dance is inside the distribution of the training dataset (HumanML3D) while sits back down is an out-of-distribution action word, and it turns out that the model performs better in the second case for unseen actions.