FORTH-ModelBasedTracker / MocapNET

We present MocapNET, a real-time method that estimates the 3D human pose directly in the popular Bio Vision Hierarchy (BVH) format, given estimations of the 2D body joints originating from monocular color images. Our contributions include: (a) A novel and compact 2D pose NSRM representation. (b) A human body orientation classifier and an ensemble of orientation-tuned neural networks that regress the 3D human pose by also allowing for the decomposition of the body to an upper and lower kinematic hierarchy. This permits the recovery of the human pose even in the case of significant occlusions. (c) An efficient Inverse Kinematics solver that refines the neural-network-based solution providing 3D human pose estimations that are consistent with the limb sizes of a target person (if known). All the above yield a 33% accuracy improvement on the Human 3.6 Million (H3.6M) dataset compared to the baseline method (MocapNET) while maintaining real-time performance
https://www.youtube.com/watch?v=Jgz1MRq-I-k
Other
834 stars 135 forks source link

A possibility to track multi-person in the same scene #54

Open iPsych opened 3 years ago

iPsych commented 3 years ago

Hello, The code works amazingly for shuffle.webm and other single person stimuli, but works very strangely when I put the multi-person video. Is there any way to expand MocapNET with multi-person one like https://paperswithcode.com/task/multi-person-pose-estimation?

AmmarkoV commented 3 years ago

The weird behavior you are refering to arises from the 2D joint heatmap detection https://github.com/FORTH-ModelBasedTracker/MocapNET/blob/master/src/JointEstimator2D/jointEstimator2D.cpp#L288 where the code tries to "retrieve" the joints with the strongest heatmap signatures..

If you get multiple persons in a scene the algorithm will try to "connect" parts of the bodies of different persons ( the parts with the highest score ) resulting in incorrect results..

In the older version of MocapNET ( MNET1 ) there used to be a mode ( https://github.com/FORTH-ModelBasedTracker/MocapNET/blob/mnet1/src/MocapNET1/MocapNETLiveWebcamDemo/mocapNETLiveDemo.cpp#L838 ) where by giving the ./MocapNETLiveWebcamDemo --rectangle X Y WIDTH HEIGHT you could actually erase a part of the image so this part of the image will get ignored, this however was a silly workaround and in the next version it got removed..

//Some datasets have persons that appear in parts of the image, we might want to cover them using a rectangle //We do this before adding any borders or otherwise change of the ROI of the image, however we do this //after possible frame skips for the obviously increased performance.. if (coveringRectangle) { cv::Point pt1(coveringRectangleX,coveringRectangleY); cv::Point pt2(coveringRectangleX+coveringRectangleWidth,coveringRectangleY+coveringRectangleHeight); cv::rectangle(frame,pt1,pt2,cv::Scalar(0,0,0),-1,8,0); }

If you think you will find this useful then I could reinstate it..

That being said the second thing one can do is use OpenPose with the -number_people_max 1 flag, this way OpenPose will just pick one skeleton and solve the issue. OpenPose uses PAFs that allow joints to be connected on the same person, and has provisions to correctly seperate persons in a scene https://github.com/FORTH-ModelBasedTracker/MocapNET/blob/master/scripts/processDatasetWithOpenpose.sh#L23

A proper solution for the live webcam demo would be to incorporate a neural network like Darknet/YOLO ( https://github.com/AlexeyAB/darknet ) run this first on the incoming OpenCV frame, retrieve the persons on the image ( as seen here https://www.youtube.com/watch?v=saDipJR14Lc#t=23m ) and then run the MocapNET pipeline on each of the retrieved rectangles ..

This will work, it will also degrade framerate linearly with more persons present in the scene ( since the Neural Network will have to be executed once for each one of them ), then you will also have the additional problem of person reidentification so that you have multiple BVH file outputs and keep track of which skeleton belongs to which BVH file and update them correctly ..

So that being said adding all this complexity on the project is overkill and it doesnt have a lot of novelty or research interest so that is why it has been skipped..!

I think at this point the best thing to be done is masking parts of the scene you dont want in an attempt to workaround, ( or just use OpenPose as the 2D engine )

Hope I did a good job explaining the issue, Looking forward to your input

Ammar