Inverse Kinematics seems not working

VisImage commented 2 years ago

Thank you for implementing the Inverse Kinematics algorithm in MocapNET. IK is a great feature. However, based on tests on our own videos, I am getting confused on how Inverse Kinematics algorithm is implemented in your repo. The following clarifications are needed: 1) In README, it is said " As described in the paper, the Hierarchical Coordinate Descent Inverse Kinematics algorithm has various hyper-parameters .....", but I am not able to find such description in your paper " MocapNET: Ensemble of SNN Encoders for 3D Human Pose Estimation in RGB Images". If it is a paper from your reference, it would be great to specify the paper, and ow let me know how it is related to your algorithm. 2) The out.bvh files generated using the pre-trained model on differentvideos (people) contain the same OFFSET values for each JOINT. Please refer to the attached files. Is this means your implementation assuming same body structure for different people? same proportion of leg to torso ? The out.bvh files were generated using the following commands: ./MocapNET2LiveWebcamDemo --from shuffle.webm ./MocapNET2LiveWebcamDemo --from golf2.mp4

mocapNET_issue71.zip

Thank you

AmmarkoV commented 2 years ago

1 ) MocapNET neural networks actually perform the IK as part of the neural network regression, however in our ICPR20 paper(the paper you are asking about) the HCD IK algorithm is a secondary generative IK module designed to fine tune the neural network, correct noise and provide some degree of personalization and reduce the "black-box" nature of the NN which sometimes might not work correctly. For the actual C implementation of the HCD IK you can look here.

2 ) MocapNET just tries to derive the relative angles and body motions depicted by the 2D joint input and currently does not directly regress or deal with the problem of the shape of a person. In order to do that the method would need to work with the RGB data directly and so this is out of the scope of this work. However given the HCD module if a user wants to deal with a specific body size you can supply some alternate joint dimension configuration using this function for example although this will get acknowledged during the runtime from the HCD IK module I am afraid the BVH dumping a.k.a. writeBVHFile call just copy pastes the default OFFSET values as seen here. I understand this is an important omission from an application standpoint unfortunately since this repository was and is more of a tech demo acompanying the research papers, rather than an actual complete retail application, this is just one of the cut-corners :) I am afraid since setting the correct offsets doesn't have a lot of research impact ( most people want the 3D points anyway to compare the method etc) .. In any case I am noting this and will keep this issue open until I find the time to implement a BVH dumper that acknowledges the different body sizes given as command line parameters.

VisImage commented 2 years ago

Thank you for the response. However, as a tech demo accompanying your research paper, the OFFSET values are important. A preset default OFFSET values may ONLY provide "good" key-point-detection result when using a specific data-set, where people have similar body structures (skeleton structure). Have you tested your implementation using different data sets?

AmmarkoV commented 2 years ago

The 2020 version of the method ( https://github.com/FORTH-ModelBasedTracker/MocapNET/tree/mnet3 ) which is the most recent achieves an average of 99 mm 3D error on Human3.6M dataset using this generic skeleton

screen-2022-04-25-23-04-31

25.37mm in the RHD hand tracking task and 9.93mm in the STB hand tracking task using the skeleton's generic hands

screen-2022-04-25-23-05-11

Yes the current OFFSET values resemble an "average" human and the neural network has been trained to react to this average "dimensions" as well as at a camera closely resembling the GoPRO Hero4 intrinsics. The more the input departs from these settings the "worse" the neural network solutions become, although due to the very high dimensionality of the 2D->3D problem even different limb dimensions that follow similar ratios can also be handled since they can resemble scale changes for example. The Hierarchical Coordinate Descent/IK module described can dynamically try to adapt to different OFFSET skeletons and different camera intrinsics and this is one of the reasons of its existance (to enable some degree of personalization). However the absolute 3D accuracy is not the method's strong suit, its target and concept is returning the relative angle configurations of the joints in real-time to enable interactive applications.

So the method works best not only depending on body structures but also specific camera systems with specific aspect ratios : http://ammar.gr/mocapnet/mnet3/ayeon.webm http://ammar.gr/mocapnet/mnet3/2020_12_14_sven_mode1hands_2dJoints_v1.4.csv_lastRun3DHiRes.mp4

You can train networks to match as specific camera, different skeletons for male and female persons etc, and load the correct one as part of a bigger application however this is out of the scope of this demo repository and the demo nature of this repository is also the reason why the training code is not included.

VisImage commented 2 years ago

Thank you for the clarifications. However, your approach will not be useful if a model is needed for each different skeleton. Each individual human being is unique and has different skeleton (structure). Therefore your approach will need a model for each person. Further more, human being is dynamic and they grow. Therefore your approach may also need a model for each period of time of the same person. BTW, there may not need to train a new model for a different camera intrinsic setting. Each video frame can be converted using a different intrinsic setting, as long as the corresponding quality degradation of the image is taken into consideration.

AmmarkoV commented 2 years ago

There are works that can handle RGB -> BMI ( https://ieeexplore.ieee.org/document/8844872 ) and given enough training samples you could actually create a module that can emmit a skeleton/OFFSET collection and select a good NN "close" to a specific skeleton profile, however this is a huge amount of work and the important part and the reason for MocapNET is how to learn to regress 3D angles from 2D joints ( therefore learning to do IK and how high dimensional structures appear on a lower dimension projection ) .

Given a set of offsets the current HCD module will work for any specific skeleton/camera-system ( regardless if the NN continuously provides skewed solutions [ because it is trained on a "generic" skeleton ] ) without any training.
Regardless of the mass of a person the skeleton joints are always the same so I don't think this will very negatively affect NN solutions since skeleton joints remain the same
Although a little kid has different dimensions than a grown human the relative ratios of limbs are not so dramatically different. ( e.g. a growing kid will not have twice the knee-foot distance compared to an adult ) so their 2D projections are still recognizable. I know this because I have tried I have also tried to perform fast heuristic tricks like for example "squishing" the aspect ratio for kids to make them slightly taller compared to their skeleton "width". A neural network will "interpret" this different scale of the person as the person being further away because of how projective geometry works.

screen-2022-04-26-15-03-56

I actually have tracked children ( knowing the camera intrinsics ) so I know it can be facilitated with a NN for adults! So since a single NN can handle skeletons ranging from 0.5m to 4m away from the camera this also "allows" it to handle some variety of different skeletons ( because this has a similar effect for the 2D patterns observed )

71081523_10156929689133040_4729425636244848640_n

For the most part I am pretty sure that just ~4-5 classes/digital representations ranging from kid/teen/adult/tall/short could encapsulate most of the diversity for the NN while the HCD can perform fine tuning to fit even better. In any case there are hundreds of papers and studies on human dimensions and they all follow some Gaussian distributions and I don't think they are so hard to model. In any case though this particular demos deal with how to make a NN derive 3D BVH angles in a real-time fashion tackling regressing an "average" person and not any person of any heights etc.

829d6cb28c2d70d4205c8ed9b20bac83

VisImage commented 2 years ago

Impressed by your description on anthropometry. Although there is another way to put pose detection and anthropometry into perspectives. It is an important task in engineering anthropometry to study human skeleton, where the body landmarks (key-points) are used as input for such study. Your figure above is a result of such study. Ideally, a key-point detection algorithm (like yours here) can be used in a anthropometrical study. In such case, I do not think it is proper to make anthropometrical assumption in the key-point detection algorithm, such as " ranging from kid/teen/adult/tall/short.."

FORTH-ModelBasedTracker / MocapNET

Inverse Kinematics seems not working #81