Daniil-Osokin / lightweight-human-pose-estimation-3d-demo.pytorch

Real-time 3D multi-person pose estimation demo in PyTorch. OpenVINO backend can be used for fast inference on CPU.
Apache License 2.0
653 stars 137 forks source link

Displaying poses_3D question #89

Closed adammpolak closed 2 years ago

adammpolak commented 2 years ago

@Daniil-Osokin thank you again for this great work!

Question regarding demo.py line 101

        poses_3d, poses_2d = parse_poses(inference_result, input_scale, stride, fx, is_video)
        edges = []
        if len(poses_3d):
            poses_3d = rotate_poses(poses_3d, R, t)
            poses_3d_copy = poses_3d.copy()
            x = poses_3d_copy[:, 0::4]
            y = poses_3d_copy[:, 1::4]
            z = poses_3d_copy[:, 2::4]
            poses_3d[:, 0::4], poses_3d[:, 1::4], poses_3d[:, 2::4] = -z, x, -y

            poses_3d = poses_3d.reshape(poses_3d.shape[0], 19, -1)[:, :, 0:3]
            edges = (Plotter3d.SKELETON_EDGES + 19 * np.arange(poses_3d.shape[0]).reshape((-1, 1, 1))).reshape((-1, 2))
        plotter.plot(canvas_3d, poses_3d, edges)

I understand poses_3d = rotate_poses(poses_3d, R, t) is used if extrinsics are provided (so the world coordinates would update).

What is going on with:

            y = poses_3d_copy[:, 1::4]
            z = poses_3d_copy[:, 2::4]
            poses_3d[:, 0::4], poses_3d[:, 1::4], poses_3d[:, 2::4] = -z, x, -y

I can't wrap my head around what is happening to the poses_3d here and for what reason? It seems like poses_3d_copy is never used by demo.py, so what is it for? Why at the end is poses_3d equal to -z, x, -y?

adammpolak commented 2 years ago

If it helps I printed the output of the transformations to poses_3d :

[[ -83.382454   -131.38255      61.1158        0.81301624  -99.85116
  -147.36206      54.76044       0.8590477   -79.05287     -81.69709
    65.751114     -1.          -86.26206    -129.5979       47.067123
     0.8386035   -79.0436     -105.39628      35.807697      0.7910373
   -77.33806     -83.324005     34.038242      0.8767383   -75.01585
   -78.021454     54.871674      0.62968177  -74.41736     -42.293255
    53.6231       -1.          -65.61671      -8.406743     55.606236
    -1.          -89.153206   -133.53835      74.301636      0.7498561
   -96.38407    -119.759384     96.08518       0.6998326  -107.42982
  -126.904724     92.74384       0.59725463  -82.1068      -81.90032
    74.50095       0.6391789   -79.82081     -45.627525     77.24726
    -1.          -70.16078     -11.660636     79.59954      -1.
   -98.39009    -148.4475       53.131226      0.8671008   -87.54239
  -145.86267      51.475685      0.7745691  -100.569756   -151.13942
    58.977722      0.8082759   -95.438286   -148.54828      63.06624
     0.71000135]]
poses_3d after -z, x, -y (line 107)
[[ -61.1158      -83.382454    131.38255       0.81301624  -54.76044
   -99.85116     147.36206       0.8590477   -65.751114    -79.05287
    81.69709      -1.          -47.067123    -86.26206     129.5979
     0.8386035   -35.807697    -79.0436      105.39628       0.7910373
   -34.038242    -77.33806      83.324005      0.8767383   -54.871674
   -75.01585      78.021454      0.62968177  -53.6231      -74.41736
    42.293255     -1.          -55.606236    -65.61671       8.406743
    -1.          -74.301636    -89.153206    133.53835       0.7498561
   -96.08518     -96.38407     119.759384      0.6998326   -92.74384
  -107.42982     126.904724      0.59725463  -74.50095     -82.1068
    81.90032       0.6391789   -77.24726     -79.82081      45.627525
    -1.          -79.59954     -70.16078      11.660636     -1.
   -53.131226    -98.39009     148.4475        0.8671008   -51.475685
   -87.54239     145.86267       0.7745691   -58.977722   -100.569756
   151.13942       0.8082759   -63.06624     -95.438286    148.54828
     0.71000135]]
poses_3d after reshape (line 109)
[[[ -61.1158    -83.382454  131.38255 ]
  [ -54.76044   -99.85116   147.36206 ]
  [ -65.751114  -79.05287    81.69709 ]
  [ -47.067123  -86.26206   129.5979  ]
  [ -35.807697  -79.0436    105.39628 ]
  [ -34.038242  -77.33806    83.324005]
  [ -54.871674  -75.01585    78.021454]
  [ -53.6231    -74.41736    42.293255]
  [ -55.606236  -65.61671     8.406743]
  [ -74.301636  -89.153206  133.53835 ]
  [ -96.08518   -96.38407   119.759384]
  [ -92.74384  -107.42982   126.904724]
  [ -74.50095   -82.1068     81.90032 ]
  [ -77.24726   -79.82081    45.627525]
  [ -79.59954   -70.16078    11.660636]
  [ -53.131226  -98.39009   148.4475  ]
  [ -51.475685  -87.54239   145.86267 ]
  [ -58.977722 -100.569756  151.13942 ]
  [ -63.06624   -95.438286  148.54828 ]]]

It seems like the final transform at line 109 gets the joints in (x, y, z)

Or is the final transform in a different format? (-z, x, -y)?

Also, it seems like that final transformation has the origin relative to the detected body, rather than the camera position. How do I get the coordinates to be in camera space?

Daniil-Osokin commented 2 years ago

Hi! rotate_poses transforms coordinates from camera space to world space (so poses_3d from parse_poses is in the camera space). The next axes swapping is used to match the 3D visualizer code, it looks like a legacy extra transform, which may be refactored.

Daniil-Osokin commented 2 years ago

Hope, it is clear now.

adammpolak commented 2 years ago

Thank you!