Extending Pose Estimation Model for 3D Objects: Customization and Challenges

FORTH-ModelBasedTracker / MocapNET

We present MocapNET, a real-time method that estimates the 3D human pose directly in the popular Bio Vision Hierarchy (BVH) format, given estimations of the 2D body joints originating from monocular color images. Our contributions include: (a) A novel and compact 2D pose NSRM representation. (b) A human body orientation classifier and an ensemble of orientation-tuned neural networks that regress the 3D human pose by also allowing for the decomposition of the body to an upper and lower kinematic hierarchy. This permits the recovery of the human pose even in the case of significant occlusions. (c) An efficient Inverse Kinematics solver that refines the neural-network-based solution providing 3D human pose estimations that are consistent with the limb sizes of a target person (if known). All the above yield a 33% accuracy improvement on the Human 3.6 Million (H3.6M) dataset compared to the baseline method (MocapNET) while maintaining real-time performance

Other

846 stars 135 forks source link

Hello!

First of all, thank you for delivering this incredible work! I'm interested in customizing the current model to estimate the 3D pose of objects like a baseball bat or tennis racket in the hands of the actor in addition to the 3D pose of the human body, which the model already does successfully. I have a few questions and doubts regarding this task:

Customizing Skeleton Hierarchy: Is it possible to customize the current skeleton hierarchy and add new bones or edges to represent the bat or racket? I assume this would be necessary to include these objects in the pose estimation.

Architectural Changes: What sort of changes will be required in the architecture of the model to accommodate the estimation of 3D pose for objects? Are there any specific layers or components that need to be modified or added?

Training Data Volume: Could you provide insights into the volume of data that the model would require for training to achieve good accuracy in estimating the 3D pose of both the human body and objects like baseball bats and tennis rackets?

Your comments and suggestions on how to approach this customization would be immensely appreciated. Thank you!

Hello! Thank you for your kind words!

And sorry about the delay responding, I am currently writing my PhD thesis while for the last months I have been abroad for almost two months for project meetings, conferences + a secondment in Italy so I was not logged in Github and did not see the issues. I received the 2FA warning and logged in after some time and show the issue today! :(

A lot of excellent questions First of all for object 3D pose you will first need to train an RGB -> 2D heatmap estimator that produces 2D "joint" data for the objects of your choice.

For a tennis racket for example 5 points, the handle the top of the racket, the sides and its center For a Baseball bat 3 points the handle, the top of the bat and its middle. etc. Although there now exist foundation models such as SAM, mask RCNNs etc that would automatically segment the racket, baseball etc you will still need some landmarks to incorporate them in the 3D pose solution.

You can easily extend the BVH file to accommodate extra geometry : If you look at https://github.com/FORTH-ModelBasedTracker/MocapNET/blob/master/dataset/headerWithHeadAndOneMotion.bvh and take a look at
http://www.dcs.shef.ac.uk/intranet/research/public/resmes/CS0111.pdf I think you can easily extend the BVH armature with such a shape.

In terms of the MocapNET model you will need to include the new "joints" of the racket/baseball to the NSRM matrices The description on how to make the descriptor is here : http://users.ics.forth.gr/~argyros/mypapers/2021_11_BMVC_Qammaz.pdf . The architecture could remain the same in my opinion it should scale to one more joint with no problems

MocapNET is typically trained on 3M pose samples. Having a BVH source like the one I use https://drive.google.com/file/d/1Zt-MycqhMylfBUqgmW9sLBclNNxoNGqV/view?usp=drive_link You will need to write a program that goes into each BVH file and applies the extra joints for your "Tool" be that a racket, hammer, baseball etc.. You will then have a dataset with enough samples

Unfortunately FORTH which is the license holder for this work, prevents me from sharing the training code for the network, however I think with the Python code shared here : https://github.com/FORTH-ModelBasedTracker/MocapNET/tree/mnet4/src/python/mnet4

a little ChatGPT help :D for the missing parts you can be successful in implementing what you propose!

FORTH-ModelBasedTracker / MocapNET

Extending Pose Estimation Model for 3D Objects: Customization and Challenges #111