Zengyi-Qin / MonoGRNet

MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Detection and Localization | KITTI
Apache License 2.0
245 stars 48 forks source link

coordinates transformation #48

Open yxy1995123 opened 4 years ago

yxy1995123 commented 4 years ago

you mentioned that the transformation from local coordinates to camera coordinates are involved with a rotation R and a translation C.I want to know where is R and C?Because I didn't see in your code to get global corners. dlogits['pred_global_corners'] = tf.reshape(tf.reshape(pred_locations, (outer_size, 3, 1)) + tf.reshape(dlogits['pred_corners'], (outer_size, 3, 8)), (outer_size, 24)) Thanks very much!

Zengyi-Qin commented 4 years ago

Thank you for your interest!

R is not explicitly implemented. I try to give you a clear understanding of the coordinate systems in the followings: diff Using a monocular image, if we directly regress rotations in camera coordinates, it can be confusing to the network. Deep3DBox explained why this happens. If we are using LIDAR point clouds, the problem is resolved.

We consider the rotation in local coordinates instead:

(step 1) The ground truth rotation (in camera coordinates) is converted to the local coordinates. See this line for details. The transformation is quite simple. The phi is computed using arctan.

(step 2) Then the ground truth rotation in local coordinates and the ground truth object size are used to compute the eight corners. We regress the corners. The regression inference output is the dlogits['pred_corners'] in this line. dlogits['pred_corners'] is in local coordinates, so the corners has zero mean. In our paper, we are supposed to first rotate the corners back to camera coordinates, then add a C (which is pred_locations). But now we skip the rotating in implementation, and resolve it later in step 3.

(step 3) When we write the inference results to the txt file, we should first compute the object size and rotation from the eight corners. KITTI dataset requires us to submit the size and rotation instead of corners. After we compute the rotation, we should realize that the rotation is in local coordinates. So we transform it back to the camera coordinates in this line.

On KITTI dataset, only one angle, instead of all the three angles (row, pitch and yaw), is considered. The object only rotates wrt. the vertical axis. So the rotation correction from camera coordinates to local coordinates is quite simple. Just compute phi. So we did not compute R or use matrix multiplication in implementation. But in other cases where three angles are all considered, we have to construct R.

yxy1995123 commented 4 years ago

Thanks for your answer. It means the global corners are true global corners(in camera coordinates) ,right? Actually, I want project corners to image plane. It means that I can use global corners directly(just instrinsic matrix),right?

---Original--- From: "Zengyi Qin"<notifications@github.com> Date: Sun, Feb 23, 2020 22:29 PM To: "Zengyi-Qin/MonoGRNet"<MonoGRNet@noreply.github.com>; Cc: "yxy1995123"<2528225804@qq.com>;"Author"<author@noreply.github.com>; Subject: Re: [Zengyi-Qin/MonoGRNet] coordinates transformation (#48)

Thank you for your interest!

R is not explicitly implemented. I try to give you a clear understanding of the coordinate systems in the followings:

Using a monocular image, if we directly regress rotations in camera coordinates, it can be confusing to the network. Deep3DBox explained why this happens. If we are using LIDAR point clouds, the problem is resolved.

We consider the rotation in local coordinates instead:

(step 1) The ground truth rotation (in camera coordinates) is converted to the local coordinates. See this line for details. The transformation is quite simple. The phi is computed using arctan.

(step 2) Then the rotation in local coordinates and the object size are used to compute the eight corners. We regress the corners. The regression output is the dlogits['pred_corners'] in this line. dlogits['pred_corners'] is in local coordinates, so the corners coordinates has zero mean. In our paper, we are supposed to first rotate the corners back to camera coordinates, then add a C (which is pred_locations). But now we skip the rotating in implementation, and resolve it later in step 3.

(step 3) When we write the inference results to the txt file, we should first compute the object size and rotation from the eight corners. KITTI dataset requires us to submit the size and rotation instead of corners. After we compute the rotation, we should realize that the rotation is in local coordinates. So we transform it back to the camera coordinates in this line.

On KITTI dataset, only one angle, instead of all the three angles (row, pitch and yaw), is considered. The object only rotates wrt. the vertical axis. So the rotation correction from camera coordinates to local coordinates is quite simple. Just compute phi. So we did not compute R or use matrix multiplication in implementation. But in other cases where three angles are all considered, we have to construct R.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Zengyi-Qin commented 4 years ago

Almost correct. Their central location pred_locations is in camera coordinates. But their rotation has not been rotated back to camera coordinates. If you are only hoping to visualize the results, it would be fine. The difference is really small. But if you hope to use the projection to do some precise calculation, I am afraid it would have an influence.