Nicholasli1995 / EvoSkeleton

Official project website for the CVPR 2020 paper (Oral Presentation) "Cascaded deep monocular 3D human pose estimation wth evolutionary training data"
https://arxiv.org/abs/2006.07778
MIT License
332 stars 44 forks source link

training input size and inference input size not match #48

Closed sunmengnan closed 3 years ago

sunmengnan commented 3 years ago

Hi, in your training scirpit h36m2Dpose.py, after loading the dataset, the input shape is (34, ), but in the inference.py, the input size is (32, ), why is that?

Nicholasli1995 commented 3 years ago

Hi, in your training scirpit h36m2Dpose.py, after loading the dataset, the input shape is (34, ), but in the inference.py, the input size is (32, ), why is that?

Hi, the inference script does not use the nose joint.

sunmengnan commented 3 years ago

Thanks. so shall we modify the training script, delete the nose joint after loading the dataset, to change the w1.weights's shape to be (32,)?

sunmengnan commented 3 years ago

I am not sure how to deal with the mismatch.

Nicholasli1995 commented 3 years ago

I am not sure how to deal with the mismatch.

Hi, you can control which joints to use in this method (for example, there is a use_nose argument): https://github.com/Nicholasli1995/EvoSkeleton/blob/b2b355f4c1fa842709f100d931189ce80008f6ef/libs/dataset/h36m/data_utils.py#L541. The model will be initialized based on the used joints.

The default training setting is used for reproducing results for the in-door dataset H36M and all 17 joints are used.

For in-the-wild images, you may not have a detector that can produce all the 17 joints. In that case you can train the 2D-to-3D network by specifying used joints yourself and enable opt.norm_single to discard indoor location information.

sunmengnan commented 3 years ago

I see. But the detector you provided has the nose point. I just add nose joint in the inference, and this time there is no error, but the result seems to be incorrect. Is there any other place needs to be changed?

Nicholasli1995 commented 3 years ago

I see. But the detector you provided has the nose point. I just add nose joint in the inference, and this time there is no error, but the result seems to be incorrect. Is there any other place needs to be changed?

Can you elaborate on incorrect results? The default in-the-wild inference model takes 16 joints and how did you use 17 joints? If you are using the 2D-to-3D model trained with indoor data for in-the-wild inference, it may not work. You need to set opt.norm_single to True to re-train the model.

sunmengnan commented 3 years ago

Thanks a lot for your instructions. I will close it and reopen if there are some related issues.

sunmengnan commented 3 years ago

1 Hi, as you can see in the result, I used the 2d-3d model you provided to infer the image, It uses 16points both in 2d net and 2dto3d net, but it seems to be incorrect. The Left elbow point and knee depth are too much big, other images' results have the same problem. Why is that ? Waiting for your reply, thanks.

Nicholasli1995 commented 3 years ago

1 Hi, as you can see in the result, I used the 2d-3d model you provided to infer the image, It uses 16points both in 2d net and 2dto3d net, but it seems to be incorrect. The Left elbow point and knee depth are too much big, other images' results have the same problem. Why is that ? Waiting for your reply, thanks.

Hi, the output 3D coordinates are relative to hip in the camera coordinate system and the unit is mm. The "depth" does not indicate real 3D location (how far the subject is from the camera). To solve for real 3D location, we need camera intrinsics of a specified image (focal length etc.).

sunmengnan commented 3 years ago

How is z axis derection defined? Some depth points printed above are from in-to-out, while some are from out-to-in.

sunmengnan commented 3 years ago

Dear Nicholasli, could you explain what else are needed expect focal length, to solve for read 3d location, and how to solve it? Thanks a lot.

Nicholasli1995 commented 3 years ago

How is z axis derection defined? Some depth points printed above are from in-to-out, while some are from out-to-in.

Z diretion is perpendicular to the image plane and follows the right-hand rule z = np.cross(x, y). x (point to right) and y (point down) are image axis.

Nicholasli1995 commented 3 years ago

Dear Nicholasli, could you explain what else are needed expect focal length, to solve for read 3d location, and how to solve it? Thanks a lot.

Assuming you have camera intrinsic parameters, you can solve a PnP problem to get translation with 2D key-points and predicted 3D key-points. See opencv solvePnP function for example.

sunmengnan commented 3 years ago

Dear Nicholasli, could you explain what else are needed expect focal length, to solve for read 3d location, and how to solve it? Thanks a lot.

Assuming you have camera intrinsic parameters, you can solve a PnP problem to get translation with 2D key-points and predicted 3D key-points. See opencv solvePnP function for example.

Does that sovePnP retrun extrinsic params, and we use (2dpints)•(extrinsic matrics)•(intrinsic matrics) to get the real 3d coordinates?

Nicholasli1995 commented 3 years ago

Dear Nicholasli, could you explain what else are needed expect focal length, to solve for read 3d location, and how to solve it? Thanks a lot.

Assuming you have camera intrinsic parameters, you can solve a PnP problem to get translation with 2D key-points and predicted 3D key-points. See opencv solvePnP function for example.

Does that sovePnP retrun extrinsic params, and we use (2dpints)•(extrinsic matrics)•(intrinsic matrics) to get the real 3d coordinates?

You have 2D key-points and intrinsics. You predict relative 3D coordinates using this repo. Then you solve for the translation. Please refer to https://docs.opencv.org/3.4/d9/d0c/group__calib3d.html#ga549c2075fac14829ff4a58bc931c033d and see cv.solvePnP( objectPoints, imagePoints, cameraMatrix, distCoeffs[, rvec[, tvec[, useExtrinsicGuess[, flags]]]] ) -> retval, rvec, tvec.

sunmengnan commented 3 years ago

Dear Nicholasli, could you explain what else are needed expect focal length, to solve for read 3d location, and how to solve it? Thanks a lot.

Assuming you have camera intrinsic parameters, you can solve a PnP problem to get translation with 2D key-points and predicted 3D key-points. See opencv solvePnP function for example.

Does that sovePnP retrun extrinsic params, and we use (2dpints)•(extrinsic matrics)•(intrinsic matrics) to get the real 3d coordinates?

You have 2D key-points and intrinsics. You predict relative 3D coordinates using this repo. Then you solve for the translation. Please refer to https://docs.opencv.org/3.4/d9/d0c/group__calib3d.html#ga549c2075fac14829ff4a58bc931c033d and see cv.solvePnP( objectPoints, imagePoints, cameraMatrix, distCoeffs[, rvec[, tvec[, useExtrinsicGuess[, flags]]]] ) -> retval, rvec, tvec.

For in-the-wild images, can we use approximate camera params to solve for the translation? Because the camera intrinsic is hard to get through calibration, and some images don't have camera sources.

Nicholasli1995 commented 3 years ago

Dear Nicholasli, could you explain what else are needed expect focal length, to solve for read 3d location, and how to solve it? Thanks a lot.

Assuming you have camera intrinsic parameters, you can solve a PnP problem to get translation with 2D key-points and predicted 3D key-points. See opencv solvePnP function for example.

Does that sovePnP retrun extrinsic params, and we use (2dpints)•(extrinsic matrics)•(intrinsic matrics) to get the real 3d coordinates?

You have 2D key-points and intrinsics. You predict relative 3D coordinates using this repo. Then you solve for the translation. Please refer to https://docs.opencv.org/3.4/d9/d0c/group__calib3d.html#ga549c2075fac14829ff4a58bc931c033d and see cv.solvePnP( objectPoints, imagePoints, cameraMatrix, distCoeffs[, rvec[, tvec[, useExtrinsicGuess[, flags]]]] ) -> retval, rvec, tvec.

For in-the-wild images, can we use approximate camera params to solve for the translation? Because the camera intrinsic is hard to get through calibration, and some images don't have camera sources.

You can, but the solved translation is only meaningful according to your assumed camera parameters. The results can be used for visualization but are incorrect. You should not expect accurate results without camera parameters.

sunmengnan commented 3 years ago

1 I use the hm3.6m img and corresponding camera R and T to infer the world coordinate, but the depth is still not correct, I don't kwow why, do you have any idea?

Nicholasli1995 commented 3 years ago

1 I use the hm3.6m img and corresponding camera R and T to infer the world coordinate, but the depth is still not correct, I don't kwow why, do you have any idea?

Can you be more specific? What do you mean by "not correct"? What is your goal?

sunmengnan commented 3 years ago

My goal is output each joint's real world coordinates, x,y and z, z stands for depth, for example, as you can see in the screenshot, the 3th point stands for right foot, the depth of it must be smaller than the 6th point (left foot), but the depth printed is 2117.768, greater than 1894.1997, and you can also compare other joinsts' depth, like right foot and left hand, they are all not correct.

sunmengnan commented 3 years ago

I used hm3.6m image to get relative 3D coordinates using this repo, and corresponding R,T read from hm3.6m camera.npy , to get each joint's real world coordinates.

Nicholasli1995 commented 3 years ago

My goal is output each joint's real world coordinates, x,y and z, z stands for depth, for example, as you can see in the screenshot, the 3th point stands for right foot, the depth of it must be smaller than the 6th point (left foot), but the depth printed is 2117.768, greater than 1894.1997, and you can also compare other joinsts' depth, like right foot and left hand, they are all not correct.

You have confusion about world coordinate system and camera coordinate system.
The depth you mean is actually measured in the camera coordinate system (z axis perpendicular to the image plane and point to the person). Look at the 3D plot, do you see the (-500,500) mark on the z axis? It grows larger to more distant region. That shows the relative pose in the camera coordinate system, which is CORRECT. If you want to have the absolute depth, add that relative pose to the root location (note: also in the camera coordinate system.) Note the world coordinate system in this dataset does not align with the camera coordinate system. That's why you think the result is incorrect.

If you don't understand English terminology, you can reply in Chinese and I can explain in Chinese.

sunmengnan commented 3 years ago

My goal is output each joint's real world coordinates, x,y and z, z stands for depth, for example, as you can see in the screenshot, the 3th point stands for right foot, the depth of it must be smaller than the 6th point (left foot), but the depth printed is 2117.768, greater than 1894.1997, and you can also compare other joinsts' depth, like right foot and left hand, they are all not correct.

You have confusion about world coordinate system and camera coordinate system. The depth you mean is actually measured in the camera coordinate system (z axis perpendicular to the image plane and point to the person). Look at the 3D plot, do you see the (-500,500) mark on the z axis? It grows larger to more distant region. That shows the relative pose in the camera coordinate system, which is CORRECT. If you want to have the absolute depth, add that relative pose to the root location (note: also in the camera coordinate system.) Note the world coordinate system in this dataset does not align with the camera coordinate system. That's why you think the result is incorrect.

If you don't understand English terminology, you can reply in Chinese and I can explain in Chinese.

the world coordinate system in this dataset is already aligned with the camera coordinate system. You can see it from the screenshot, world_coordinate = cameras.camera_to_world_frame(depth_array,R,T) 1

Nicholasli1995 commented 3 years ago

My goal is output each joint's real world coordinates, x,y and z, z stands for depth, for example, as you can see in the screenshot, the 3th point stands for right foot, the depth of it must be smaller than the 6th point (left foot), but the depth printed is 2117.768, greater than 1894.1997, and you can also compare other joinsts' depth, like right foot and left hand, they are all not correct.

You have confusion about world coordinate system and camera coordinate system. The depth you mean is actually measured in the camera coordinate system (z axis perpendicular to the image plane and point to the person). Look at the 3D plot, do you see the (-500,500) mark on the z axis? It grows larger to more distant region. That shows the relative pose in the camera coordinate system, which is CORRECT. If you want to have the absolute depth, add that relative pose to the root location (note: also in the camera coordinate system.) Note the world coordinate system in this dataset does not align with the camera coordinate system. That's why you think the result is incorrect. If you don't understand English terminology, you can reply in Chinese and I can explain in Chinese.

the world coordinate system in this dataset is already aligned with the camera coordinate system. You can see it from the screenshot, world_coordinate = cameras.camera_to_world_frame(depth_array,R,T) 1

You did not get my point. The x, y and z axis of the world coordinate system does not align with those of the camera coordinate system. Perhaps you can plot these 6 vectors to see the difference.

This is not a bug and I'm closing this issue. You can still reply or post questions in the discussion section.

sunmengnan commented 3 years ago

大佬你上面先説 P_world = R P_Camera + t = Rt P_Camera., 得到R,T就可以計算世界坐標係的深度,然后又说不是這麽align的,所以你的意思是?如何計算世界坐標係的深度

Nicholasli1995 commented 3 years ago

大佬你上面先説 P_world = R P_Camera + t = Rt P_Camera., 得到R,T就可以計算世界坐標係的深度,然后又说不是這麽align的,所以你的意思是?如何計算世界坐標係的深度

你好,你需要区分一下世界坐标系和相机坐标系,并且想清楚他们之间的坐标转换。你想要的深度其实是人体在相机坐标系里的z-坐标。你可以看看这张图里面的坐标系:https://learnopencv.com/geometry-of-image-formation/ 注意看相机坐标系的z轴指向你观察的方向。

再注意看世界坐标系里面的三个坐标轴,世界坐标系的z轴是你想要的观察方向吗?它们跟相机坐标系的三个轴之间的关系可以是任意的,由这里的R, t决定。 P_world = R P_Camera + t = Rt P_Camera 确实是把相机坐标系里的坐标转换到世界坐标系,但是那不是你想要的坐标。H36M数据集的世界坐标系的坐标轴跟相机坐标系的不对齐(align),那就是我的意思。

要得到相机坐标系的深度,你只需要先记录人体root节点的相机坐标就可以,预测出的相对3D姿态也是相机坐标系的,加上去就得到你想要的所有关节点在相机坐标系的坐标。

sunmengnan commented 3 years ago

1 大佬,我把相機坐標系的z值打印出來了,比如上圖的話,頭部應該比腿部、脚部的z值大吧(數值越遠越大),但結果看上去是相反的

Nicholasli1995 commented 3 years ago

1 大佬,我把相機坐標系的z值打印出來了,比如上圖的話,頭部應該比腿部、脚部的z值大吧(數值越遠越大),但結果看上去是相反的

你能把画那个3D图的输入打印出来吗?输入画图函数的坐标打印一下看看。

sunmengnan commented 3 years ago

1

Nicholasli1995 commented 3 years ago

你这里的坐标应该是x,z,y顺序的。注意一下画图之前坐标的顺序变了。 https://github.com/Nicholasli1995/EvoSkeleton/blob/b2b355f4c1fa842709f100d931189ce80008f6ef/examples/inference.py#L158 https://github.com/Nicholasli1995/EvoSkeleton/blob/b2b355f4c1fa842709f100d931189ce80008f6ef/examples/inference.py#L145

sunmengnan commented 3 years ago

3 感謝大佬 這個單位是mm嗎?

Nicholasli1995 commented 3 years ago

3 感謝大佬 這個單位是mm嗎?

是的