NVIDIA-ISAAC-ROS / isaac_ros_pose_estimation

Deep learned, NVIDIA-accelerated 3D object pose estimation
https://developer.nvidia.com/isaac-ros-gems
Apache License 2.0
178 stars 26 forks source link

Not getting good accuracy in original centerpose repo #26

Closed ArghyaChatterjee closed 1 year ago

ArghyaChatterjee commented 1 year ago

Hello,

I was trying the original centerpose repo. Since that repo is almost dead, I had to put it here.

I am not getting correct pose from the original repo that you provided. I have wrapped the original repo in ros2 and I am using a zed2 camera for this. I am getting the location values all negative and does not correspond to the pointcloud that zed2 is producing as well as does not correspond to the actual value.

N.B:

  1. We had to scale that down to visualize that in rviz. I divide that location values inside ret dictionary for mug detector by -10, for bottle detector by -5 to get somewhat close location that I am expecting.

  2. There is a transform between zed_left_camera_optical_frame and zed_left_camera_frame. Roll:90, Pitch:0 and Yaw: 90 which I am feeding through tf_listener. So, that should not be an issue while publishing transformed pose as markerarray messages. For your info, the zed camera is publishing the image in zed2_left_camera_optical_frame and the pointcloud in zed2_left_camera_frame.

  3. If I let the detector predict and show the result as it is coming from the ret dictionary itself (without dividing by any scale factor like -10 or -5), the cuboid (kps_3d_cam) is predicting bigger bounding box and result['location'][0], result['location'][1], result['location'][2], result['quaternion_xyzw'][0], result['quaternion_xyzw'][1], result['quaternion_xyzw'][2], result['quaternion_xyzw'][3] values are producing very distant values.

  4. It's predicting good keypoints in the camera frame and I am taking the those keypoints to join the vertices and pubish the cuboid lines on top of the image to visualize as a ros2 image message.

  5. The image height 360 and width 640 from zed2. THe camera matrix from the camera info is: [261.46716309 0. 317.64129639] [ 0. 261.46716309 184.7205658 ] [ 0. 0. 1. ]] I have confirmed that the camera matrix is set correctly before the pnp.

  6. Here is how the location, projected_cuboid_on_image_plane, kps_3d_cam and kps_pnp looks like for the mug. 'location': [-0.5383151494873346, 0.3168819516759661, -3.4610857721545782], 'quaternion_xyzw': Quaternion([0.20069474, 0.14685222, 0.6839772 , 0.685807 ]), 'projected_cuboid': array([[296.43928217, 162.00167229], [284.15555313, 195.76860971], [292.60675472, 97.6401634 ], [276.97268945, 115.82458124], [385.72230985, 163.34079112], [399.13305934, 198.5982746 ], [393.4721076 , 97.32110031],

'kps_3d_cam': array([[-0.55721415, 0.42819387, -3.62821758], [-0.32193835, -0.32383182, -4.33916303], [ 0.15960613, -0.29761269, -3.33544751], [-1.22296397, -0.35016673, -3.90619734], [-0.7414195 , -0.32394759, -2.90248181], [-0.37300881, 1.18033533, -4.35395335], [ 0.10853567, 1.20655447, -3.35023782], [-1.27403443, 1.15400043, -3.92098765], [-0.79248996, 1.18021957, -2.91727213]]),

'kps_pnp': array([[0.54559495, 0.4001076 ], [0.46582499, 0.45922601], [0.45986142, 0.54786711], [0.45969119, 0.28572123], [0.45071688, 0.32758482], [0.60706837, 0.45088982], [0.64344698, 0.53664212], [0.61655417, 0.27711902], [0.66159557, 0.31581069]])}

I have added a video to show how it looks like without dividing by scaling factor. If you divide it by -10, it somwhat shows near the pointcloud in the rviz. If you divide it by 100 (say cm to meter conversion, it's very close.

Here is a video demonstration: https://youtu.be/fDfqkyQhPig. In rviz, the grid size is 1m and there are in total 10 grids.

Interesting thing is that the object scale that network head is predicting is saying the object scale is somewhat close to the original scale that it was trained for. So the ret dictionary is also predicting the object scale like this:

'obj_scale': array([1.4041487, 0.9990581, 1.0546126]

But when I try to visualize the object as a marker, that's way big. So, I assume the error is coming from that thing as if the object's 2d keypoints are calculated correctly in image frame but calculating those keypoints (kps_3d_cam) which are far apart (doesnot correspond to actual box units) during pnp step is somewhat wrong.

jaiveersinghNV commented 1 year ago

We've recently made a substantial upgrade to the Isaac ROS CenterPose package as part of our latest release. Could you try again with the latest version and see if you're still experiencing the same problem?

kajananchinniahNV commented 1 year ago

Hello, here are my responses to your questions:

A scale factor is expected. CenterPose can only detect poses and bounding box dimensions up to a scale. Due to this, a scale factor is expected. Please read the Robot experiment section of the original paper to see ideas to get the detections in metric units. The original paper simply used the height of the object as a scaling factor. We exposed a parameter called height in our previous release (we've updated it to cuboid_scaling_factor for this release) that you can use to scale the bounding box size. Have you confirmed that you're getting a similar result on a test image from the original repository?

Reviewing the documentation about OpenCV's PnP computation might help. Please ensure that the points are represented in the correct frame before applying a transformation to camera's base frame.

Just to clarify, are both the quaternion values and position values off? Converting the quaternion to an alternative representation might be useful to assess how off the values are.

The bounding box sizes are also only up to a scale, and hence must be scaled as well.

It's probably also a good idea to visualize the reprojected 2d keypoints rather than the keypoints before projection in order to help debug this issue.

Hope this helps and covers your issues!