FORTH-ModelBasedTracker / MocapNET

We present MocapNET, a real-time method that estimates the 3D human pose directly in the popular Bio Vision Hierarchy (BVH) format, given estimations of the 2D body joints originating from monocular color images. Our contributions include: (a) A novel and compact 2D pose NSRM representation. (b) A human body orientation classifier and an ensemble of orientation-tuned neural networks that regress the 3D human pose by also allowing for the decomposition of the body to an upper and lower kinematic hierarchy. This permits the recovery of the human pose even in the case of significant occlusions. (c) An efficient Inverse Kinematics solver that refines the neural-network-based solution providing 3D human pose estimations that are consistent with the limb sizes of a target person (if known). All the above yield a 33% accuracy improvement on the Human 3.6 Million (H3.6M) dataset compared to the baseline method (MocapNET) while maintaining real-time performance
https://www.youtube.com/watch?v=Jgz1MRq-I-k
Other
840 stars 135 forks source link

Inverse Kinematics seems not working #81

Closed VisImage closed 2 years ago

VisImage commented 2 years ago

Thank you for implementing the Inverse Kinematics algorithm in MocapNET. IK is a great feature. However, based on tests on our own videos, I am getting confused on how Inverse Kinematics algorithm is implemented in your repo. The following clarifications are needed: 1) In README, it is said " As described in the paper, the Hierarchical Coordinate Descent Inverse Kinematics algorithm has various hyper-parameters .....", but I am not able to find such description in your paper " MocapNET: Ensemble of SNN Encoders for 3D Human Pose Estimation in RGB Images". If it is a paper from your reference, it would be great to specify the paper, and ow let me know how it is related to your algorithm. 2) The out.bvh files generated using the pre-trained model on differentvideos (people) contain the same OFFSET values for each JOINT. Please refer to the attached files. Is this means your implementation assuming same body structure for different people? same proportion of leg to torso ? The out.bvh files were generated using the following commands: ./MocapNET2LiveWebcamDemo --from shuffle.webm ./MocapNET2LiveWebcamDemo --from golf2.mp4

mocapNET_issue71.zip

Thank you

AmmarkoV commented 2 years ago

1 ) MocapNET neural networks actually perform the IK as part of the neural network regression, however in our ICPR20 paper(the paper you are asking about) the HCD IK algorithm is a secondary generative IK module designed to fine tune the neural network, correct noise and provide some degree of personalization and reduce the "black-box" nature of the NN which sometimes might not work correctly. For the actual C implementation of the HCD IK you can look here.

2 ) MocapNET just tries to derive the relative angles and body motions depicted by the 2D joint input and currently does not directly regress or deal with the problem of the shape of a person. In order to do that the method would need to work with the RGB data directly and so this is out of the scope of this work. However given the HCD module if a user wants to deal with a specific body size you can supply some alternate joint dimension configuration using this function for example although this will get acknowledged during the runtime from the HCD IK module I am afraid the BVH dumping a.k.a. writeBVHFile call just copy pastes the default OFFSET values as seen here. I understand this is an important omission from an application standpoint unfortunately since this repository was and is more of a tech demo acompanying the research papers, rather than an actual complete retail application, this is just one of the cut-corners :) I am afraid since setting the correct offsets doesn't have a lot of research impact ( most people want the 3D points anyway to compare the method etc) .. In any case I am noting this and will keep this issue open until I find the time to implement a BVH dumper that acknowledges the different body sizes given as command line parameters.

VisImage commented 2 years ago

Thank you for the response. However, as a tech demo accompanying your research paper, the OFFSET values are important. A preset default OFFSET values may ONLY provide "good" key-point-detection result when using a specific data-set, where people have similar body structures (skeleton structure). Have you tested your implementation using different data sets?

AmmarkoV commented 2 years ago

The 2020 version of the method ( https://github.com/FORTH-ModelBasedTracker/MocapNET/tree/mnet3 ) which is the most recent achieves an average of 99 mm 3D error on Human3.6M dataset using this generic skeleton

screen-2022-04-25-23-04-31

25.37mm in the RHD hand tracking task and 9.93mm in the STB hand tracking task using the skeleton's generic hands

screen-2022-04-25-23-05-11

Yes the current OFFSET values resemble an "average" human and the neural network has been trained to react to this average "dimensions" as well as at a camera closely resembling the GoPRO Hero4 intrinsics. The more the input departs from these settings the "worse" the neural network solutions become, although due to the very high dimensionality of the 2D->3D problem even different limb dimensions that follow similar ratios can also be handled since they can resemble scale changes for example. The Hierarchical Coordinate Descent/IK module described can dynamically try to adapt to different OFFSET skeletons and different camera intrinsics and this is one of the reasons of its existance (to enable some degree of personalization). However the absolute 3D accuracy is not the method's strong suit, its target and concept is returning the relative angle configurations of the joints in real-time to enable interactive applications.

So the method works best not only depending on body structures but also specific camera systems with specific aspect ratios : http://ammar.gr/mocapnet/mnet3/ayeon.webm http://ammar.gr/mocapnet/mnet3/2020_12_14_sven_mode1hands_2dJoints_v1.4.csv_lastRun3DHiRes.mp4

You can train networks to match as specific camera, different skeletons for male and female persons etc, and load the correct one as part of a bigger application however this is out of the scope of this demo repository and the demo nature of this repository is also the reason why the training code is not included.

VisImage commented 2 years ago

Thank you for the clarifications. However, your approach will not be useful if a model is needed for each different skeleton. Each individual human being is unique and has different skeleton (structure). Therefore your approach will need a model for each person. Further more, human being is dynamic and they grow. Therefore your approach may also need a model for each period of time of the same person. BTW, there may not need to train a new model for a different camera intrinsic setting. Each video frame can be converted using a different intrinsic setting, as long as the corresponding quality degradation of the image is taken into consideration.

AmmarkoV commented 2 years ago

There are works that can handle RGB -> BMI ( https://ieeexplore.ieee.org/document/8844872 ) and given enough training samples you could actually create a module that can emmit a skeleton/OFFSET collection and select a good NN "close" to a specific skeleton profile, however this is a huge amount of work and the important part and the reason for MocapNET is how to learn to regress 3D angles from 2D joints ( therefore learning to do IK and how high dimensional structures appear on a lower dimension projection ) .

screen-2022-04-26-15-03-56

71081523_10156929689133040_4729425636244848640_n

829d6cb28c2d70d4205c8ed9b20bac83

VisImage commented 2 years ago

Impressed by your description on anthropometry. Although there is another way to put pose detection and anthropometry into perspectives. It is an important task in engineering anthropometry to study human skeleton, where the body landmarks (key-points) are used as input for such study. Your figure above is a result of such study. Ideally, a key-point detection algorithm (like yours here) can be used in a anthropometrical study. In such case, I do not think it is proper to make anthropometrical assumption in the key-point detection algorithm, such as " ranging from kid/teen/adult/tall/short.."