Closed sunmengnan closed 3 years ago
Hi, in your training scirpit h36m2Dpose.py, after loading the dataset, the input shape is (34, ), but in the inference.py, the input size is (32, ), why is that?
Hi, the inference script does not use the nose joint.
Thanks. so shall we modify the training script, delete the nose joint after loading the dataset, to change the w1.weights's shape to be (32,)?
I am not sure how to deal with the mismatch.
I am not sure how to deal with the mismatch.
Hi, you can control which joints to use in this method (for example, there is a use_nose argument): https://github.com/Nicholasli1995/EvoSkeleton/blob/b2b355f4c1fa842709f100d931189ce80008f6ef/libs/dataset/h36m/data_utils.py#L541. The model will be initialized based on the used joints.
The default training setting is used for reproducing results for the in-door dataset H36M and all 17 joints are used.
For in-the-wild images, you may not have a detector that can produce all the 17 joints. In that case you can train the 2D-to-3D network by specifying used joints yourself and enable opt.norm_single to discard indoor location information.
I see. But the detector you provided has the nose point. I just add nose joint in the inference, and this time there is no error, but the result seems to be incorrect. Is there any other place needs to be changed?
I see. But the detector you provided has the nose point. I just add nose joint in the inference, and this time there is no error, but the result seems to be incorrect. Is there any other place needs to be changed?
Can you elaborate on incorrect results? The default in-the-wild inference model takes 16 joints and how did you use 17 joints? If you are using the 2D-to-3D model trained with indoor data for in-the-wild inference, it may not work. You need to set opt.norm_single to True to re-train the model.
Thanks a lot for your instructions. I will close it and reopen if there are some related issues.
Hi, as you can see in the result, I used the 2d-3d model you provided to infer the image, It uses 16points both in 2d net and 2dto3d net, but it seems to be incorrect. The Left elbow point and knee depth are too much big, other images' results have the same problem. Why is that ? Waiting for your reply, thanks.
Hi, as you can see in the result, I used the 2d-3d model you provided to infer the image, It uses 16points both in 2d net and 2dto3d net, but it seems to be incorrect. The Left elbow point and knee depth are too much big, other images' results have the same problem. Why is that ? Waiting for your reply, thanks.
Hi, the output 3D coordinates are relative to hip in the camera coordinate system and the unit is mm. The "depth" does not indicate real 3D location (how far the subject is from the camera). To solve for real 3D location, we need camera intrinsics of a specified image (focal length etc.).
How is z axis derection defined? Some depth points printed above are from in-to-out, while some are from out-to-in.
Dear Nicholasli, could you explain what else are needed expect focal length, to solve for read 3d location, and how to solve it? Thanks a lot.
How is z axis derection defined? Some depth points printed above are from in-to-out, while some are from out-to-in.
Z diretion is perpendicular to the image plane and follows the right-hand rule z = np.cross(x, y). x (point to right) and y (point down) are image axis.
Dear Nicholasli, could you explain what else are needed expect focal length, to solve for read 3d location, and how to solve it? Thanks a lot.
Assuming you have camera intrinsic parameters, you can solve a PnP problem to get translation with 2D key-points and predicted 3D key-points. See opencv solvePnP function for example.
Dear Nicholasli, could you explain what else are needed expect focal length, to solve for read 3d location, and how to solve it? Thanks a lot.
Assuming you have camera intrinsic parameters, you can solve a PnP problem to get translation with 2D key-points and predicted 3D key-points. See opencv solvePnP function for example.
Does that sovePnP retrun extrinsic params, and we use (2dpints)•(extrinsic matrics)•(intrinsic matrics) to get the real 3d coordinates?
Dear Nicholasli, could you explain what else are needed expect focal length, to solve for read 3d location, and how to solve it? Thanks a lot.
Assuming you have camera intrinsic parameters, you can solve a PnP problem to get translation with 2D key-points and predicted 3D key-points. See opencv solvePnP function for example.
Does that sovePnP retrun extrinsic params, and we use (2dpints)•(extrinsic matrics)•(intrinsic matrics) to get the real 3d coordinates?
You have 2D key-points and intrinsics. You predict relative 3D coordinates using this repo. Then you solve for the translation. Please refer to https://docs.opencv.org/3.4/d9/d0c/group__calib3d.html#ga549c2075fac14829ff4a58bc931c033d and see cv.solvePnP( objectPoints, imagePoints, cameraMatrix, distCoeffs[, rvec[, tvec[, useExtrinsicGuess[, flags]]]] ) -> retval, rvec, tvec.
Dear Nicholasli, could you explain what else are needed expect focal length, to solve for read 3d location, and how to solve it? Thanks a lot.
Assuming you have camera intrinsic parameters, you can solve a PnP problem to get translation with 2D key-points and predicted 3D key-points. See opencv solvePnP function for example.
Does that sovePnP retrun extrinsic params, and we use (2dpints)•(extrinsic matrics)•(intrinsic matrics) to get the real 3d coordinates?
You have 2D key-points and intrinsics. You predict relative 3D coordinates using this repo. Then you solve for the translation. Please refer to https://docs.opencv.org/3.4/d9/d0c/group__calib3d.html#ga549c2075fac14829ff4a58bc931c033d and see cv.solvePnP( objectPoints, imagePoints, cameraMatrix, distCoeffs[, rvec[, tvec[, useExtrinsicGuess[, flags]]]] ) -> retval, rvec, tvec.
For in-the-wild images, can we use approximate camera params to solve for the translation? Because the camera intrinsic is hard to get through calibration, and some images don't have camera sources.
Dear Nicholasli, could you explain what else are needed expect focal length, to solve for read 3d location, and how to solve it? Thanks a lot.
Assuming you have camera intrinsic parameters, you can solve a PnP problem to get translation with 2D key-points and predicted 3D key-points. See opencv solvePnP function for example.
Does that sovePnP retrun extrinsic params, and we use (2dpints)•(extrinsic matrics)•(intrinsic matrics) to get the real 3d coordinates?
You have 2D key-points and intrinsics. You predict relative 3D coordinates using this repo. Then you solve for the translation. Please refer to https://docs.opencv.org/3.4/d9/d0c/group__calib3d.html#ga549c2075fac14829ff4a58bc931c033d and see cv.solvePnP( objectPoints, imagePoints, cameraMatrix, distCoeffs[, rvec[, tvec[, useExtrinsicGuess[, flags]]]] ) -> retval, rvec, tvec.
For in-the-wild images, can we use approximate camera params to solve for the translation? Because the camera intrinsic is hard to get through calibration, and some images don't have camera sources.
You can, but the solved translation is only meaningful according to your assumed camera parameters. The results can be used for visualization but are incorrect. You should not expect accurate results without camera parameters.
I use the hm3.6m img and corresponding camera R and T to infer the world coordinate, but the depth is still not correct, I don't kwow why, do you have any idea?
I use the hm3.6m img and corresponding camera R and T to infer the world coordinate, but the depth is still not correct, I don't kwow why, do you have any idea?
Can you be more specific? What do you mean by "not correct"? What is your goal?
My goal is output each joint's real world coordinates, x,y and z, z stands for depth, for example, as you can see in the screenshot, the 3th point stands for right foot, the depth of it must be smaller than the 6th point (left foot), but the depth printed is 2117.768, greater than 1894.1997, and you can also compare other joinsts' depth, like right foot and left hand, they are all not correct.
I used hm3.6m image to get relative 3D coordinates using this repo, and corresponding R,T read from hm3.6m camera.npy , to get each joint's real world coordinates.
My goal is output each joint's real world coordinates, x,y and z, z stands for depth, for example, as you can see in the screenshot, the 3th point stands for right foot, the depth of it must be smaller than the 6th point (left foot), but the depth printed is 2117.768, greater than 1894.1997, and you can also compare other joinsts' depth, like right foot and left hand, they are all not correct.
You have confusion about world coordinate system and camera coordinate system.
The depth you mean is actually measured in the camera coordinate system (z axis perpendicular to the image plane and point to the person).
Look at the 3D plot, do you see the (-500,500) mark on the z axis? It grows larger to more distant region. That shows the relative pose in the camera coordinate system, which is CORRECT.
If you want to have the absolute depth, add that relative pose to the root location (note: also in the camera coordinate system.)
Note the world coordinate system in this dataset does not align with the camera coordinate system. That's why you think the result is incorrect.
If you don't understand English terminology, you can reply in Chinese and I can explain in Chinese.
My goal is output each joint's real world coordinates, x,y and z, z stands for depth, for example, as you can see in the screenshot, the 3th point stands for right foot, the depth of it must be smaller than the 6th point (left foot), but the depth printed is 2117.768, greater than 1894.1997, and you can also compare other joinsts' depth, like right foot and left hand, they are all not correct.
You have confusion about world coordinate system and camera coordinate system. The depth you mean is actually measured in the camera coordinate system (z axis perpendicular to the image plane and point to the person). Look at the 3D plot, do you see the (-500,500) mark on the z axis? It grows larger to more distant region. That shows the relative pose in the camera coordinate system, which is CORRECT. If you want to have the absolute depth, add that relative pose to the root location (note: also in the camera coordinate system.) Note the world coordinate system in this dataset does not align with the camera coordinate system. That's why you think the result is incorrect.
If you don't understand English terminology, you can reply in Chinese and I can explain in Chinese.
the world coordinate system in this dataset is already aligned with the camera coordinate system. You can see it from the screenshot, world_coordinate = cameras.camera_to_world_frame(depth_array,R,T)
My goal is output each joint's real world coordinates, x,y and z, z stands for depth, for example, as you can see in the screenshot, the 3th point stands for right foot, the depth of it must be smaller than the 6th point (left foot), but the depth printed is 2117.768, greater than 1894.1997, and you can also compare other joinsts' depth, like right foot and left hand, they are all not correct.
You have confusion about world coordinate system and camera coordinate system. The depth you mean is actually measured in the camera coordinate system (z axis perpendicular to the image plane and point to the person). Look at the 3D plot, do you see the (-500,500) mark on the z axis? It grows larger to more distant region. That shows the relative pose in the camera coordinate system, which is CORRECT. If you want to have the absolute depth, add that relative pose to the root location (note: also in the camera coordinate system.) Note the world coordinate system in this dataset does not align with the camera coordinate system. That's why you think the result is incorrect. If you don't understand English terminology, you can reply in Chinese and I can explain in Chinese.
the world coordinate system in this dataset is already aligned with the camera coordinate system. You can see it from the screenshot, world_coordinate = cameras.camera_to_world_frame(depth_array,R,T)
You did not get my point. The x, y and z axis of the world coordinate system does not align with those of the camera coordinate system. Perhaps you can plot these 6 vectors to see the difference.
This is not a bug and I'm closing this issue. You can still reply or post questions in the discussion section.
大佬你上面先説 P_world = R P_Camera + t = Rt P_Camera., 得到R,T就可以計算世界坐標係的深度,然后又说不是這麽align的,所以你的意思是?如何計算世界坐標係的深度
大佬你上面先説 P_world = R P_Camera + t = Rt P_Camera., 得到R,T就可以計算世界坐標係的深度,然后又说不是這麽align的,所以你的意思是?如何計算世界坐標係的深度
你好,你需要区分一下世界坐标系和相机坐标系,并且想清楚他们之间的坐标转换。你想要的深度其实是人体在相机坐标系里的z-坐标。你可以看看这张图里面的坐标系:https://learnopencv.com/geometry-of-image-formation/ 注意看相机坐标系的z轴指向你观察的方向。
再注意看世界坐标系里面的三个坐标轴,世界坐标系的z轴是你想要的观察方向吗?它们跟相机坐标系的三个轴之间的关系可以是任意的,由这里的R, t决定。 P_world = R P_Camera + t = Rt P_Camera 确实是把相机坐标系里的坐标转换到世界坐标系,但是那不是你想要的坐标。H36M数据集的世界坐标系的坐标轴跟相机坐标系的不对齐(align),那就是我的意思。
要得到相机坐标系的深度,你只需要先记录人体root节点的相机坐标就可以,预测出的相对3D姿态也是相机坐标系的,加上去就得到你想要的所有关节点在相机坐标系的坐标。
大佬,我把相機坐標系的z值打印出來了,比如上圖的話,頭部應該比腿部、脚部的z值大吧(數值越遠越大),但結果看上去是相反的
大佬,我把相機坐標系的z值打印出來了,比如上圖的話,頭部應該比腿部、脚部的z值大吧(數值越遠越大),但結果看上去是相反的
你能把画那个3D图的输入打印出来吗?输入画图函数的坐标打印一下看看。
感謝大佬 這個單位是mm嗎?
感謝大佬 這個單位是mm嗎?
是的
Hi, in your training scirpit h36m2Dpose.py, after loading the dataset, the input shape is (34, ), but in the inference.py, the input size is (32, ), why is that?