EGO4D / episodic-memory

MIT License
104 stars 60 forks source link

VQ 3D and 2D annotation consistency? #20

Open yunhan-zhao opened 2 years ago

yunhan-zhao commented 2 years ago

Hi,

I'm having a question regarding the consistency between vq2d annotations and 3d annotations. I tried pulling out one 3d annotation (using the center) and projecting it down to any frame in the GT response track. Specifically, I followed the code here to first convert the 3d centroid to the frame coordinate and inverse the operation here to project to 2d space. By comparing the projected 2d centroid and GT bounding box, I have found they are not even close on the 2D image plane.

I have checked the accuracy of the pose by rendering images with poses, it visually looks fine (btw, I render blank RGB images with the visualization code but I can render depth maps.). I'm using GT 2d results so the only possible error source I could find here is the camera pose. How should I interpret this result or if I have done anything wrong here? Should I still assume the camera pose is still not accurate enough?

To replicate: clip_uid: "1eb995af-0fdd-4f0f-a5ba-089d4b8cb445" I end up having 3 valid queries in the vq3d val set. I have tried them all and none of them looks close.

This is one of the examples I have, the projected 2d centroid is not even in the frame. image

vincentcartillier commented 2 years ago

To visualize 3D annotations you should construct the 3D bounding boxes and visualize them using a 3D mesh software (Meshlab for instance). To build the 3D annotations and export them to .off files you can use the following.

For the query you are looking at this would be this:

For the projection of the annotation center you might need to adjust the axis. Here is the code snippet I used:

K = [[f, 0, cx], [0, f, cy], [0, 0, 1]]
pose = all_poses[rf_fno]
t = np.ones(4)
t[:3] = box.center
t_cam = np.matmul(pose, t)
t_cam = t_cam / t_cam[3]
t_cam = t_cam[:3]
t_frame = np.matmul(K, t_cam)
t_px = [
  - int(t_frame[0] / t_frame[2]),
  int(t_frame[1] / t_frame[2])
]

This projects the center at the red marker here.

For reference here is the camera pose estimation visualization for this frame (left is from the video, middle is the reprojection from the scan, right is the superposition):

render_0000161

For reproducibility here is the query info:

yunhan-zhao commented 2 years ago

Thanks for the instruction! I have tried the code snippet you shared but somehow it is still not close as shown. This is what I got:

image

So for this very specific case:

Could you help me check if my following values match yours?

intrinsic matrix: [[641.14, 0, 720.0], [0, 641.14, 540.0], [0, 0, 1]] frame index: 161 (should be the last frame in the GT response track) box center 1: [-0.9356525938744581, 2.791068629224036, 0.8609450489891934] Corresponding pose for this frame: [[ 0.78528216 -0.5185672 -0.33826023 -1.28948307] [-0.60689388 -0.75283253 -0.25479993 2.38282685] [-0.12252242 0.40537791 -0.90590121 1.45194995] [ 0. 0. 0. 1. ]]

Additionally, I could run the headless render but it doesn't render RGB images. So for any mesh with the pose, I got the following. Do you have any idea what could be wrong with the rendering step?

Irender_0000153 But I can render depth render_0000161