anibali / margipose

3D monocular human pose estimation
Apache License 2.0
98 stars 20 forks source link

Marginal heatmap explanation #15

Closed gianscarpe closed 4 years ago

gianscarpe commented 4 years ago

Hi, I've a question regarding 2D marginal heatmap. I didn't understand fully: do you use camera parameters to project xyz coordinates to camera coordinate? In this case, how do you manage to project xz and yz plane? Thanks in advance, Gianluca

anibali commented 4 years ago

See the section of code here: https://github.com/anibali/margipose/blob/944da16de0af5e5860807612530d1b4c981e92f0/src/margipose/models/margipose_model.py#L228-L230

Here target_xyz contains the (normalised) coordinates in 3D space, and target_xy, target_zy, and target_xz are the coordinates in different 2D planes derived from target_xyz. Essentially it's just a matter of dropping one of the coordinates (so basically orthogonal projection), a process which does not require camera parameters.

gianscarpe commented 4 years ago

Hi, thank you for your prompt reply. I don't follow you fully. How can you juxtapose xy heatmaps predicted by your model with the input image? If your model predict xy plane of a "world" coordinates system, this means that "world" is perfectly aligned with your camera. Am I right?

anibali commented 4 years ago

First of all, it's worth being clear about the fact that the xyz coordinates are camera-relative (i.e. in "camera space"), not in some world space with an arbitrary origin. So the camera position and orientation are effectively not factors in solving the task. ~However, this still means that there's a small difference between perspective projection using the camera intrinsics (i.e. how the pixels are placed in the input image) and the orthographic projection of just dropping the z coordinate (i.e. how pixels are placed in the xy heatmaps). I think that this discrepancy might be what you're asking about?~

~In practice the discrepancy just does not seem to be an issue---the model is able to use information from the perspective-projected input image to predict the orthographic-projected xy heatmaps quite accurately. I suppose you can think of this as the model learning a little bit about the camera intrinsic parameters themselves from the input image. I'll admit that it gets a bit strange when you also consider training on data with 2D image-space annotations since then the targets are perspective-projected (not orthographic), but the extra data does seem to help with generalisation regardless.~

EDIT: I made a mistake in my description above. Since the normalisation procedure puts things into NDC space, dropping the z coordinate is actually equivalent to perspective projection when you take a holistic view, so there isn't a discrepancy. Put another way, the normalised coordinates do not have orthogonal basis vectors, they are coordinates in a trapezoidal viewing frustum constructed using camera intrinsics.

gianscarpe commented 4 years ago

I got it! Really interesting. I've another question for you, related to 3d coordinates. In particular, I'm using kornia normalize_pixel_coords3d to normalize the joints coords (now referenced to camera coordinate system), because I noticed you contributed to that either. However, I think normalize_pixel_coords3d expects the coordinates to be positive along all the axis, which is not my case. Do you have any suggestions? I noticed you developed "pose3d_utils" to normalize/denormalize 3d coord. Is it still valid? You help is really appreciated, thank you a lot!

anibali commented 4 years ago

The normalisation code in pose3d_utils is quite different to what Kornia's normalize_pixel_coordinates3d does.

mshooter commented 1 year ago

I know this is an old question, however just to be clear, the xy heatmaps do not correspond to the image space?

anibali commented 1 year ago

@mshooter It's been a long time since I've thought about this, but if my memory serves correctly they do correspond.

mshooter commented 1 year ago

@anibali it seems that the xyz coordinates are camera space and then projected. However, how do you project the 3d coordinates to image space if you do not have the intrinsics for example (MPII dataset)?

anibali commented 1 year ago

The model works with coordinates in normalised device coordinates (NDC space). This means that e.g. points in the XY heatmap correspond to points in the input image. Now if you want to go from NDC space to a metric space after prediction (e.g. you want things expressed in millimetres), then yes you need a little bit more information. For this you could assume camera intrinsics and a known depth/person height. Please refer to https://github.com/anibali/margipose/issues/15#issuecomment-643532400 and https://github.com/anibali/margipose/issues/15#issuecomment-643845855

mshooter commented 1 year ago

Thank you!