Open smandava98 opened 1 year ago
@smandava98 (and also to the authors) I saw you had raised an issue regarding the motion tokens/vectors previously. I am a bit confused as to what the motion outputs represent. Specifically, when I run the gradio demo, for a text prompt it generates a video and a motion array of size (N, 263) where N is the nos of frames in the video. So I guess these 263-dimensional vectors depict the 2D skeletal pose across each frame. Is this vector a specific output of any pose estimator model? I mean if I want to have a raw motion sequence that matches these dimensions and can be passed as input to the model, how do I get that vector? anything related to running SMPL/MoCapo and getting the same desired input in the form as it is outputted by the model while generating a video?
An explanation of the same would be highly helpful. thanks in advance:)
@smandava98 (and also to the authors) I saw you had raised an issue regarding the motion tokens/vectors previously. I am a bit confused as to what the motion outputs represent. Specifically, when I run the gradio demo, for a text prompt it generates a video and a motion array of size (N, 263) where N is the nos of frames in the video. So I guess these 263-dimensional vectors depict the 2D skeletal pose across each frame. Is this vector a specific output of any pose estimator model? I mean if I want to have a raw motion sequence that matches these dimensions and can be passed as input to the model, how do I get that vector? anything related to running SMPL/MoCapo and getting the same desired input in the form as it is outputted by the model while generating a video?
An explanation of the same would be highly helpful. thanks in advance:)
So, can I understand that the VQGAN's input is a 263-dimensional numerical vector and the output is a motion token from 1 to 512 (since the Motion Codebook Dimension is 512)? Additionally, any valid 263-dimensional numerical vector can determine a 3D picture rendering, right? I am new to 3D motion, and I appreciate any guidance in advance.
The human motion is represented in 263-dimensional vector, as it is in HUMANML3D dataset, and the model I guess is trained on Humanml3d dataset. The 263-dimensional vector contains a lot of information, you can refer to humanml3d dataset for details. Generally speaking, humanml3d dataset transform the human motion in Amass dataset from SMPL-H representation to a new 263 dimensional vector, and the transformation is not inversible. When you start from a 263-dimensional vector, the 22 joint location [Time,22,3] can be calculated, and then this joint location can be fitted to a human mesh, which is the SMPL representation (without explicit hand pose), and this fitting step is what the fit.py is doing.
@wzyabcas 22 joints? I thought SMPL was 23 joints. Is this missing a joint then?
@wzyabcas 22 joints? I thought SMPL was 23 joints. Is this missing a joint then?
Hi, the Humanml3d dataset processes the SMPL-H format data from Amass dataset to its 263 representation. The 22 joint is the first 22 joints of SMPL-H, and SMPL-H has another 15 joints for left hand and 15 joints for right hand. SMPL has 24 joints, SMPL and SMPL-H has the same vertices index for the 6890 body vertices, and the first 22 SMPL joint corresponded to first 22 SMPL-H joint. SMPL doesn't have precise hand MANO model, so the last 2 joints are rough left and right hand.
Hi. I have a couple questions regarding how motion tokens are fed in in inference and training. I have an array of SMPL parameters (pose, beta, etc.).
Do I have to convert it into a .ply file of a video like in the demo and it takes in that format only? Can it take in raw arrays or other format files? I don't have access to Blender so I can't use Blender to generate these files.
Does the motion tokens have to fit in a shared environment space? Meaning if I have 2 different motion files for "a person running", do they have to be exact motion tokens or can they be translated a bit (aka different x/y coordinates)?