Why use three models to detect body, hands and feet separately?

dynamic-X-LAB / PhotoPoster

To support and further the research in the field of portrait animation , we are excited to launch PhotoPoster, an open project for pose-driven image generation.

https://photo-poster.github.io/

224 stars 20 forks source link

Why use three models to detect body, hands and feet separately? #4

Closed loliq closed 2 months ago

loliq commented 3 months ago

Hello，thank you for your work，I am curious why three models are used to detect body, hands and feet respectively. Why not just use the Wholebody of dwpose？

Looking for your replying ， thanks.

dynamic-X-LAB commented 2 months ago

Hello，thank you for your work，I am curious why three models are used to detect body, hands and feet respectively. Why not just use the Wholebody of dwpose？

Looking for your replying ， thanks.

We utilized several models to ensure that the pose points are sufficiently precise and robust, primarily to guarantee the accuracy of the training data. For the body, we used rtmpose-x, which has the highest Whole AP (Average Precision) for body poses, but it doesn't include hand points. For the hands, we used rtmw-x, which includes fine points for hand details, although its foot points are often significantly off. For the feet, we used the more robust dwpose. These more precise pose points have led to a noticeable improvement in the training of the model. If it's just for prediction, you can choose to use the simpler dwpose.

Bin-sam commented 2 months ago

Hello，thank you for your work，I am curious why three models are used to detect body, hands and feet respectively. Why not just use the Wholebody of dwpose？

Looking for your replying ， thanks.

Our training video data is complex, and the dwpose Wholebody model performed poorly. We chose rtmpose-x, which had the highest AP for 17 keypoints (COCO-defined), excluding hand and foot points. For these, we used Wholebody RTMW-x, but its foot keypoint detection was weak. Given the minimal impact of foot points, we switched to the lightweight DWPose model for foot detection, selecting models based on their specific strengths.

loliq commented 2 months ago

Hello，thank you for your work，I am curious why three models are used to detect body, hands and feet respectively. Why not just use the Wholebody of dwpose？ Looking for your replying ， thanks.

Our training video data is complex, and the dwpose Wholebody model performed poorly. We chose rtmpose-x, which had the highest AP for 17 keypoints (COCO-defined), excluding hand and foot points. For these, we used Wholebody RTMW-x, but its foot keypoint detection was weak. Given the minimal impact of foot points, we switched to the lightweight DWPose model for foot detection, selecting models based on their specific strengths.

Got it， thanks for replying

loliq commented 2 months ago

Hello，thank you for your work，I am curious why three models are used to detect body, hands and feet respectively. Why not just use the Wholebody of dwpose？ Looking for your replying ， thanks.

We utilized several models to ensure that the pose points are sufficiently precise and robust, primarily to guarantee the accuracy of the training data. For the body, we used rtmpose-x, which has the highest Whole AP (Average Precision) for body poses, but it doesn't include hand points. For the hands, we used rtmw-x, which includes fine points for hand details, although its foot points are often significantly off. For the feet, we used the more robust dwpose. These more precise pose points have led to a noticeable improvement in the training of the model. If it's just for prediction, you can choose to use the simpler dwpose.

I see， thanks for replying