Question about 15.6M frame-level whole-body pose description

simon3dv commented 11 months ago

Hi authors, I am very amazed by your work. I notice you use face recognition, posescript and handscript to generate 15.6M frame-level descriptions and have some questions about it.

How to generate frame-level description for a RGB video with only part of body, for example, (a)only the upper-body can be seen, (b)only see the face and shoulder can be seen, (c) part of the body is self-occluded bn or is occluded by loose cloth or some obstacles.
How to generate frame-level description for a RGB video with multiple persons? Or just delete all videos with multiple persons?
Are these descriptions used in one of your experiments to validate it is right, or for new application? The Tab.4(text-driven motion generation) seens not support frame-level description since they always require at least 24 frames as input. Do you aggregate all frames-level into one video-level description (If yes, how to aggregate it)? Is the Tab.6 the only experiment related to frame-level description(Besides, how do you compute the FID in Tab.6)?

ailingzengzzz commented 10 months ago

Hi @simon3dv,

We first estimate SMPL-X from videos and then translate SMPL-X to texts via PoseScript. To guarantee the motion annotation quality, we will first process and filter videos from online videos. As much as possible, the whole body should be visible. For invisible parts, we ignore the textual description.
We only estimate and track one person for a motion sequence.
How to utilize frame-level description for motion generation or other applications is a open question. You can refer to PoseScript and PoseGPT for single-frame pose applications. Besides, from Table 6, we simply sample one sentence of a frame-level pose description to control partial pose in this frame instead of all frames.

Besides, we update the frame-level textual descriptions for each whole-body pose. Please download it here and refer to this usage guidance PoseTEXT_README.

simon3dv commented 10 months ago

Thanks! By the way, I am using PoseScript to generate captions for some open-dataset(3DPW, DNA-Rendering, ...). The PoseScript auto captioning pipeline normalize SMPL to orient_y=0(looking at camera) before captioning. It leads to some problems when extending to captioning image with invisible body part. I tried using keypoint confidence from RTMPose to detect which part is invisible, but found the confidence does help but is not robust enough to represent real visibility. I also tried to use the inconsistent results from RTMPose and SSLPose as invisibility, but the results are bad yet. Here are some examples. (a)Left hand is occluded, right hand is visible, but incorrectly detect right elbow to be occluded. (b)Left hand is not occluded, but detect to be occluded. Do you have solution to fix the caption in these two cases? Did you also use confidence from pose detection to judge which part is invisible ?

IDEA-Research / Motion-X

Question about 15.6M frame-level whole-body pose description #35