Closed yumianhuli1 closed 6 months ago
@yumianhuli1 No it uses multiple frames as input. By using transformer module, it uses information both from previous and future frames to have a better 3D estimation for the current frame. Based on equation (2) in the paper, we also use a loss to force the prediction to be smooth and not suddenly jump from one area to another.
The table in README.md also shows the number of input frames used for each variant. E.g. MotionAGFormer-B when trained on Human3.6M dataset uses 243 frames as input but for MPI-INF-3DHP it uses 81.
@yumianhuli1 No it uses multiple frames as input. By using transformer module, it uses information both from previous and future frames to have a better 3D estimation for the current frame. Based on equation (2) in the paper, we also use a loss to force the prediction to be smooth and not suddenly jump from one area to another.
The table in README.md also shows the number of input frames used for each variant. E.g. MotionAGFormer-B when trained on Human3.6M dataset uses 243 frames as input but for MPI-INF-3DHP it uses 81.
@SoroushMehraban thanks! And I would like to ask you a question: If the key points of gestures and facial expressions in the video need to be accurately estimated in 3d, does this not belong to the scope of 3d pose estimation? What kind of information should I look at? Could you plz give me some suggestions?
@yumianhuli1 In case you're interested to model facial expressions in addition to gestures, you have the following options:
@yumianhuli1 In case you're interested to model facial expressions in addition to gestures, you have the following options:
In 3D space:
- You can use models that output SMPL-X representation. Motion-X is a new benchmark that covers covers facial expressions, hand gestures, and main body.
- If you're only interested in keypoints and not the shape, you can check H3WB that is a new benchmark from Human3.6M videos that covers facial + hand keypoints. The authors have used models that receive only one frame as input and estimate the output. I once tested MotionBERT and got significantly better results than what they reported. So you can also do that.
In 2D Space:
There are lot's of models that cover whole-body 2D keypoints. here are some examples:
Thank you for your help
@yumianhuli1 In case you're interested to model facial expressions in addition to gestures, you have the following options:
In 3D space:
- You can use models that output SMPL-X representation. Motion-X is a new benchmark that covers covers facial expressions, hand gestures, and main body.
- If you're only interested in keypoints and not the shape, you can check H3WB that is a new benchmark from Human3.6M videos that covers facial + hand keypoints. The authors have used models that receive only one frame as input and estimate the output. I once tested MotionBERT and got significantly better results than what they reported. So you can also do that.
In 2D Space:
There are lot's of models that cover whole-body 2D keypoints. here are some examples:
Hello, I want to ask you how MotionBERT faired when trained on H3WB, especially with the hand keypoints. I'm currently trying to implement motion tracking character animation on Unity using 2D->3D keypoint estimation models currently available. One of my goals is to make finger and hand orientation as accurate as possible, but I have doubts on how precise motionBERT can predict that while predicting on whole-body.
Thanks for the great work btw.
Is MotionAGFormer using only the 2D pose information of the current frame to predict 3D information? Will this effect be worse than using multiple frames of information before the current frame to predict 3D information?Whether the use of time series information can improve the smoothness and robustness of predictions