caizhongang / SMPLer-X

Official Code for "SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation"
https://caizhongang.github.io/projects/SMPLer-X/
Other
975 stars 69 forks source link

[Question] Camera Translation #64

Closed jameskuma closed 2 months ago

jameskuma commented 3 months ago

Thank you for sharing this work!

I find SMPLer-X estimate camera translation with function here:

# 2. Body Regressor
body_joint_hm, body_joint_img = self.body_position_net(img_feat)
root_pose, body_pose, shape, cam_param, = self.body_regressor(body_pose_token, shape_token, cam_token, body_joint_img.detach())
root_pose = rot6d_to_axis_angle(root_pose)
body_pose = rot6d_to_axis_angle(body_pose.reshape(-1, 6)).reshape(body_pose.shape[0], -1)  # (N, J_R*3)
cam_trans = self.get_camera_trans(cam_param)

And self.get_camera_trans take cam_param as input. I am confused about the meaning of cam_param. It seems like

cam_param[:, 2]: a scale factor
cam_param[:, :2]: [tx, ty]

In function self.get_camera_trans:

def get_camera_trans(self, cam_param):
    # camera translation
    t_xy = cam_param[:, :2]
    gamma = torch.sigmoid(cam_param[:, 2])  # apply sigmoid to make it positive
    k_value = torch.FloatTensor([
        math.sqrt(cfg.focal[0] * cfg.focal[1] * cfg.camera_3d_size * cfg.camera_3d_size / (cfg.input_body_shape[0] * cfg.input_body_shape[1]))
    ]).cuda().view(-1)
    t_z = k_value * gamma
    cam_trans = torch.cat((t_xy, t_z[:, None]), 1)
    return cam_trans

What is the meaning of gamma and k_value? Moreover, as far as I know, the correct camera translation should be further adjusted since image cropping is applied to the input image. If I want get the camera translation to the full image, which part should I change or I should recompute cam_param?

wqyin commented 2 months ago

Hello, thanks for your interest in our work.

You may interpret gamma as a predicted scale factor and k_value as combined virtual camera intrinsics as defined in the config file.

If you need to transform the projection from bbox space to full image space, you may refer to this code in the inference pipeline, where we transform the camera intrinsics with the bbox information in the full image.

jameskuma commented 2 months ago

Sure! Thank you for replying! It solved my confusion and so I close this issue.