Regarding coordinate systems for pose representation

raoshashank commented 10 months ago

Thanks for the implementation! I was trying to run this on my dataset and wanted to know if the coordinates specified in the yaml file for the initial guess (object3d.position, object3d.rotation) are specified in the camera coordinate frame or the world frame? iirc, mtx in the render_texture_batch function is supposed to be the view matrix, however when this function is called, the matrix passed as input is mtx_gu = torch.bmm(self.view_mtx.unsqueeze(0).repeat(self.batchsize,1,1),matrix_batch_44_from_position_quat( p=result["trans"], q=result["quat"] )), Which seems to only accommodate the object pose relative to the world frame and doesn't include a camera view matrix.

Thanks!

TontonTremblay commented 10 months ago

Yeah normally you would have 'object -> world -> camera'. I skipped the 'object -> world'. And you just have 'object -> camera'. Sorry about this, but it should not be hard to add an extra step. For the application I was working with, e.g., single camera pose estimation, you would not see 'object -> world', you have everything expressed in the camera frame.

These are express in the camera frame : https://github.com/NVlabs/diff-dope/blob/main/diffdope/diffdope.py#L936-L955

You could have it express in world coordinate frame. and in the camera add the world2camera.

then in the rendering add the transform there: https://github.com/NVlabs/diff-dope/blob/main/diffdope/diffdope.py#L156-L168 but this wont scale well if you have multiple objects. https://github.com/NVlabs/diff-dope/blob/main/diffdope/diffdope.py#L920-L933 this returns the pos. of the object. Add a matrix multiplication there to go into world frame there. Then you could deal with multiple object. Now you have an other interesting problem. What do you optimize for? camera poses or object poses. Or both. Anyway I think you could do all of them, but be careful how you set things up.

Now that I think about it, the implementation the way I did it might be a little arkward to play with for your case, I think I should have done transforms like we did in nvisii, much cleaner implementation than this little mess now we have, I might revisit this if I have time.

raoshashank commented 10 months ago

Yes, I added in the world->camera and the object->world transformation matrices. For my particular use case, I'm trying to estimate the pose with a static camera with a known pose, thus I'm trying to find the transform T:object->world. I'm not sure if my implementation is correct, but here are the summary of my changes, please let me know if it sounds right:

in render_texture_batch: a. final_mtx_proj = torch.matmul(proj_cam,torch.matmul(cam_pose_world,obj_pose_world)) b. depth = dd.xfm_points(gb_pos.contiguous(), object_pose_world*camera_pose_world)
Provide the initial coarse estimate in world coordinates

raoshashank commented 10 months ago

I also had the following questions: a. The coordinates here are in the coordinate system local to the object (body frame)? b. For context, My dataset is a set of images of object clutters taken in pybullet, the depth image in pybullet uses the non-linear z-buffer directly from OpenGL. If I understand the code correctly, the raw depth values from the interpolate function are in the object frame and so to compare it with the pybullet depth image,I need to transform the interpolated depth values to the camera frame and the pybullet depth values to the linear scale? c. can you clarify why the opencv_2_opengl function is required?

TontonTremblay commented 10 months ago

I am not a 100% sure what is going on tbh, moving around transforms can be a little bit of a mind game.

object_pose_world*camera_pose_world I think this should be a dot product, replace with @ or torch.matmul(). And it should be torch.matmul(camerapose_world, object_pose_world).

The easiest way to do it I think would be to express everything in the camera frame from pybullet. You should not have any differences between the coordinate frame as nvdiffrast uses opengl as well. So you should be able to plug things directly from pybullet into this. Also matmul is somewhat expensive, so you want to minimize the number of calls, specially if you want to take the gradients out.

Otherwise if you want to continue the way you are doing it. Use open3d to slowly debug your coordinate frames. load the point cloud apply your different transforms to it and see where you end up. I would say add one at the time. Start with just the object, the object in world, add the camera in world, then look if things are aligned. Then see if you can express the same scene but as a function of the camera, final_mtx_proj (if you remove proj) should give you everything in the camera frame.

raoshashank commented 9 months ago

I see. I have also noticed that the l1_mask loss doesn't contribute to the gradients of the object3d parameters directly (likely due to the line mask = ddope.renders["rgb"] > 0 which is a non-differentiable function), apart from increasing or decreasing the loss based on the mask. Can you share why this mask loss is required then?