FORTH-ModelBasedTracker / MocapNET

We present MocapNET, a real-time method that estimates the 3D human pose directly in the popular Bio Vision Hierarchy (BVH) format, given estimations of the 2D body joints originating from monocular color images. Our contributions include: (a) A novel and compact 2D pose NSRM representation. (b) A human body orientation classifier and an ensemble of orientation-tuned neural networks that regress the 3D human pose by also allowing for the decomposition of the body to an upper and lower kinematic hierarchy. This permits the recovery of the human pose even in the case of significant occlusions. (c) An efficient Inverse Kinematics solver that refines the neural-network-based solution providing 3D human pose estimations that are consistent with the limb sizes of a target person (if known). All the above yield a 33% accuracy improvement on the Human 3.6 Million (H3.6M) dataset compared to the baseline method (MocapNET) while maintaining real-time performance
https://www.youtube.com/watch?v=Jgz1MRq-I-k
Other
857 stars 137 forks source link

Can more skeleton bones be supported by MocapNet? #16

Closed igormisha closed 4 years ago

igormisha commented 4 years ago

I've watched the promo videos of results gotten with the library here: https://www.youtube.com/watch?v=fH5e-KMBvM0 And I've looked through all 2-d points in your example .csv file here: https://github.com/FORTH-ModelBasedTracker/MocapNET/blob/master/dataset/sample.csv

From what I've seen the results are not accurate at least in two body areas, namely in the spine, which is not devided into parts, so it doesn't reflect natural body moves and feet, which are turned in another direction compared to human models.

So the question is: can the result be improved, I mean can at least the spine be split to, for example, 3 segments and be more precise as for the feet moves?

I'm absolute newbie in this area, just share my observations, may be it has nothing to do with the library, in which case I'd be grateful to hear where the problem is.

AmmarkoV commented 4 years ago

Hello, The BVH file used encodes the spine using Hip->Abdomen->Chest->Neck

As you correctly stated by adding more joints in the spine as is the case in the Blender/Makehuman skinned model, you can achieve more natural body moves, since the human spine is actually a dozen of joints concatenated and the neural network ensemble formulation we use could easily handle the extra degrees of freedom with a minor performance penalty.. screen46

However, the limitations leading to the design decision of reducing the number of joints are two :

  1. The BVH version of the CMU motion capture data used for training the neural network has this number of joints. So if the Mocap suite used to record the dataset had extra joints recorded, no matter how many, the method could indeed use them and can generalize in this direction. Since the training dataset hasn't recorded these joints there is no ground truth to train the neural network against.

  2. The BODY25 output format of modern 2D neural networks has actually NO 2D joint estimations for the spine! This is because since there are no unique landmarks that can robustly be identified in the belly(in contrast to the arms, legs, face etc ). So a secondary problem is that even if the network had learnt to derive 3D rotations for all the spine joints from 2D observations you would still require the spine to be detected using a 2D detector which is currently not the case..!

keypoints_pose_25

So these are the reasons in a nutshell..!

igormisha commented 4 years ago

Ammar, first of all, big thanks for such thorough explanation of the problem, all this became clear now for me. Anyway, I have to say that this inaccuracy in results is very limiting in terms of possible applications. May be it can be improved, I could imagine some additional biomechanical considerations taking into account the distance (from point 1 to 8 on your picture, for example) and other data.

AmmarkoV commented 4 years ago

The goal of this work is affordable markerless real-time motion capture from RGB sources in a highly compatible format with existing software. Given that goal, any extra sources of data for the spine could of course improve the quality however ultimately go against the core idea of the project.

Just having a 20$ latex suit for example with red markers could be enough to get a much better result, however this automatically excludes a very big range of applications (99.9% of videos are plain RGB without markers) and not what we are trying to do..

The next version of MocapNET that will be released in this repository after being peer reviewed and published in a computer vision conference is better in terms of accuracy, and I believe it does a slightly better job at the spine mainly because it allows the user to personalize the skeleton dimensions. This means that by also having the constraint of knowing the distance of points 1->8 that we observe as 2D joints we can at least do a better job at fitting 1 and 8 correctly.. The current version of MocapNET published here treats this also as an unknown further complicating things.. However the improved version still relies on the BODY25 2D input and same BVH file so is again subject to the limitations I described.

Moving forward and in order to improve this method I believe that it could be fused with something like DensePose in order to get a richer source of "2D" information allowing for higher fidelity, however this would probably have a big performance impact and possibly other side-effects coords

In any case it is always nice to chat with people and hear their perspectives and use-cases. I noted your request for a better spine and hopefully will gradually achieve it! Personally I would be very happy if someday I was able to get this kind of output from this pose estimator but you have to first walk before you can run :)

igormisha commented 4 years ago

It's very enlightening info for me, really appreciative, Ammar! Your goal is really great, I was looking exectly for what you are trying to achive! What puzzles me is if there is so simple and cheap method to provide more accurate 2D pose essessments with more points to track by wearing marked latex why there are no open databases with such points so people like you could train their models on and not to limit yourselves by using BODY25 2D with only 2 points for the whole spine? Anyway, I wish you good luck and happy coding!

AmmarkoV commented 4 years ago

To be perfectly honest I am pretty happy that there are at least some datasets available to the public like the CMU Mocap/ whose BVH conversion I usethat allowed me to train a network in the first place, there are also some new ones that contain hands and faces although once again the spine is not modeled :) The reason is that MotionCapture hardware operation is very expensive, a Vicon system for example can cost over 100.000$ and requires a warehouse to besetup, while cheaper solutions like the Nansense IMU based mocap suit cost more than 8.000$, so if you are poor like me you have to make due with whatever you can find :D Good luck to you as well!