Daniil-Osokin / lightweight-human-pose-estimation-3d-demo.pytorch

Real-time 3D multi-person pose estimation demo in PyTorch. OpenVINO backend can be used for fast inference on CPU.
Apache License 2.0
653 stars 137 forks source link

Some questions about the code #99

Closed sk-zhang closed 11 months ago

sk-zhang commented 11 months ago

@Daniil-Osokin First of all thanks a lot for this nice work, I have some questions about the code in the translate poses section. I want to convert the output 3d relative coordinates to the world coordinate system, so I've looked at the following code, but didn't quite get the exact idea of it. So I would like to ask the following questions: First, is this procedure specific to this project, or can it be applied to other 3d relative coordinates to world 3d coordinates? Second, if so, can you tell me the specific operation idea? ` # translate poses for pose_id in range(len(poses_3d)): pose_3d = poses_3d[pose_id].reshape((-1, 4)).transpose() pose_2d = poses_2d[pose_id][:-1].reshape((-1, 3)).transpose() num_valid = np.count_nonzero(pose_2d[2] != -1) pose_3d_valid = np.zeros((3, num_valid), dtype=np.float32) pose_2d_valid = np.zeros((2, num_valid), dtype=np.float32) valid_id = 0 for kpt_id in range(pose_3d.shape[1]): if pose_2d[2, kpt_id] == -1: continue pose_3d_valid[:, valid_id] = pose_3d[0:3, kpt_id] pose_2d_valid[:, valid_id] = pose_2d[0:2, kpt_id] valid_id += 1

    pose_2d_valid[0] = pose_2d_valid[0] - features_shape[2]/2
    pose_2d_valid[1] = pose_2d_valid[1] - features_shape[1]/2
    mean_3d = np.expand_dims(pose_3d_valid.mean(axis=1), axis=1)
    mean_2d = np.expand_dims(pose_2d_valid.mean(axis=1), axis=1)
    numerator = np.trace(np.dot((pose_3d_valid[:2, :] - mean_3d[:2, :]).transpose(),
                                pose_3d_valid[:2, :] - mean_3d[:2, :])).sum()
    numerator = np.sqrt(numerator)
    denominator = np.sqrt(np.trace(np.dot((pose_2d_valid[:2, :] - mean_2d[:2, :]).transpose(),
                                          pose_2d_valid[:2, :] - mean_2d[:2, :])).sum())
    mean_2d = np.array([mean_2d[0, 0], mean_2d[1, 0], fx * input_scale / stride])
    mean_3d = np.array([mean_3d[0, 0], mean_3d[1, 0], 0])
    translation = numerator / denominator * mean_2d - mean_3d

    if is_video:
        translation = current_poses_2d[pose_id].filter(translation)
    for kpt_id in range(19):
        pose_3d[0, kpt_id] = pose_3d[0, kpt_id] + translation[0]
        pose_3d[1, kpt_id] = pose_3d[1, kpt_id] + translation[1]
        pose_3d[2, kpt_id] = pose_3d[2, kpt_id] + translation[2]
    translated_poses_3d.append(pose_3d.transpose().reshape(-1))

`

Daniil-Osokin commented 11 months ago

Hi! These 3D coordinates are relative to the camera coordinate system. If you have camera pose wrt world coordinate system, you can apply it to the coordinates and get them wrt world coordinate system.

sk-zhang commented 11 months ago

Thank you for your reply. I mostly didn't understand this part of the operation: _mean_3d = np.expand_dims(pose_3d_valid.mean(axis=1), axis=1) mean_2d = np.expand_dims(pose_2d_valid.mean(axis=1), axis=1) numerator = np.trace(np.dot((pose_3d_valid[:2, :] - mean_3d[:2, :]).transpose(), pose_3d_valid[:2, :] - mean_3d[:2, :])).sum() numerator = np.sqrt(numerator) denominator = np.sqrt(np.trace(np.dot((pose_2d_valid[:2, :] - mean_2d[:2, :]).transpose(), pose_2d_valid[:2, :] - mean_2d[:2, :])).sum()) mean_2d = np.array([mean_2d[0, 0], mean_2d[1, 0], fx input_scale / stride]) mean_3d = np.array([mean_3d[0, 0], mean_3d[1, 0], 0]) translation = numerator / denominator mean_2d - mean3d My understanding is that the average of the 3D and 2D pose data is computed, then the sum of the Euclidean distances of the 2D and 3D pose coordinates to the corresponding averages is computed, and then a translation vector translation is computed for translating the 3D pose data. But I don't understand the rationale of this.

sk-zhang commented 11 months ago

I mainly want to use this method for multiplayer 3d pose display, I am currently able to get the pixel coordinates of a 2d pose, and also the coordinates of a 3d pose relative to its own origin, and the internal and external parameters of the camera. I would like to use this part of the code to transfer the 3d pose to the world coordinate system, how would I apply this method please?

Daniil-Osokin commented 11 months ago

But I don't understand the rationale of this.

It is the solution for minimum projection error, you can check the 'Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision' paper for the details. 3D keypoints coordinates are predicted relative to root keypoint coordinate system. If the root keypoint is translated in a proper place in 3D space, then all pose is in correct 3D coordinates (in camera coordinate system). So we find such translation vector, which minimizes 3D->2D keypoints projection error given camera intrinsics. Given camera pose in world coordinate system, just apply it to keypoint coordinates to move them from camera coordinate system to world coordinate system.

sk-zhang commented 11 months ago

Thank you, I understand. I would like to ask one last question, regarding the following code: _mean_2d = np.array([mean_2d[0, 0], mean_2d[1, 0], fx * inputscale / stride]) If the 'mean_2d' I have calculated already has coordinates in the original image dimensions, does 'fx' still need to be scaled, or can I directly use 'fx'?

Daniil-Osokin commented 11 months ago

You do not need to scale focal length in case of mean_2d in the original image space.

sk-zhang commented 11 months ago

Thank you so much for the explanation. I'll close the issue.

Daniil-Osokin commented 11 months ago

Great, that it helped!