NVlabs / diff-dope

Pose estimation refiner
Other
39 stars 7 forks source link

Regarding coordinate systems for pose representation #1

Closed raoshashank closed 9 months ago

raoshashank commented 10 months ago

Thanks for the implementation! I was trying to run this on my dataset and wanted to know if the coordinates specified in the yaml file for the initial guess (object3d.position, object3d.rotation) are specified in the camera coordinate frame or the world frame? iirc, mtx in the render_texture_batch function is supposed to be the view matrix, however when this function is called, the matrix passed as input is mtx_gu = torch.bmm(self.view_mtx.unsqueeze(0).repeat(self.batchsize,1,1),matrix_batch_44_from_position_quat( p=result["trans"], q=result["quat"] )), Which seems to only accommodate the object pose relative to the world frame and doesn't include a camera view matrix.

Thanks!

TontonTremblay commented 10 months ago

Yeah normally you would have 'object -> world -> camera'. I skipped the 'object -> world'. And you just have 'object -> camera'. Sorry about this, but it should not be hard to add an extra step. For the application I was working with, e.g., single camera pose estimation, you would not see 'object -> world', you have everything expressed in the camera frame.

These are express in the camera frame : https://github.com/NVlabs/diff-dope/blob/main/diffdope/diffdope.py#L936-L955

You could have it express in world coordinate frame. and in the camera add the world2camera.

then in the rendering add the transform there: https://github.com/NVlabs/diff-dope/blob/main/diffdope/diffdope.py#L156-L168 but this wont scale well if you have multiple objects. https://github.com/NVlabs/diff-dope/blob/main/diffdope/diffdope.py#L920-L933 this returns the pos. of the object. Add a matrix multiplication there to go into world frame there. Then you could deal with multiple object. Now you have an other interesting problem. What do you optimize for? camera poses or object poses. Or both. Anyway I think you could do all of them, but be careful how you set things up.

Now that I think about it, the implementation the way I did it might be a little arkward to play with for your case, I think I should have done transforms like we did in nvisii, much cleaner implementation than this little mess now we have, I might revisit this if I have time.

raoshashank commented 10 months ago

Yes, I added in the world->camera and the object->world transformation matrices. For my particular use case, I'm trying to estimate the pose with a static camera with a known pose, thus I'm trying to find the transform T:object->world. I'm not sure if my implementation is correct, but here are the summary of my changes, please let me know if it sounds right:

raoshashank commented 10 months ago

I also had the following questions: a. The coordinates here are in the coordinate system local to the object (body frame)? b. For context, My dataset is a set of images of object clutters taken in pybullet, the depth image in pybullet uses the non-linear z-buffer directly from OpenGL. If I understand the code correctly, the raw depth values from the interpolate function are in the object frame and so to compare it with the pybullet depth image,I need to transform the interpolated depth values to the camera frame and the pybullet depth values to the linear scale? c. can you clarify why the opencv_2_opengl function is required?

TontonTremblay commented 10 months ago

I am not a 100% sure what is going on tbh, moving around transforms can be a little bit of a mind game.

object_pose_world*camera_pose_world I think this should be a dot product, replace with @ or torch.matmul(). And it should be torch.matmul(camerapose_world, object_pose_world).

The easiest way to do it I think would be to express everything in the camera frame from pybullet. You should not have any differences between the coordinate frame as nvdiffrast uses opengl as well. So you should be able to plug things directly from pybullet into this. Also matmul is somewhat expensive, so you want to minimize the number of calls, specially if you want to take the gradients out.

Otherwise if you want to continue the way you are doing it. Use open3d to slowly debug your coordinate frames. load the point cloud apply your different transforms to it and see where you end up. I would say add one at the time. Start with just the object, the object in world, add the camera in world, then look if things are aligned. Then see if you can express the same scene but as a function of the camera, final_mtx_proj (if you remove proj) should give you everything in the camera frame.

raoshashank commented 9 months ago

I see. I have also noticed that the l1_mask loss doesn't contribute to the gradients of the object3d parameters directly (likely due to the line mask = ddope.renders["rgb"] > 0 which is a non-differentiable function), apart from increasing or decreasing the loss based on the mask. Can you share why this mask loss is required then?

TontonTremblay commented 9 months ago

The version I released is a rewrite of the code I used in the paper, I tried to cleaned it up. I think you might have found a bug. humm. I am going out to vacation this weekend I dont think I will have time to go through what this means.

raoshashank commented 9 months ago

Do let me know when you get back, thanks!

TontonTremblay commented 9 months ago

Quickly looking at documentation for torch, you can take the gradient from this operation. I am just making a mask of what was render. You dont need this loss, you can remove it if you don't want it. This is just comparing the observed mask and generating a mask from the rgb image.

TontonTremblay commented 9 months ago

https://pytorch.org/docs/stable/generated/torch.where.html#torch-where I believe this is the function that gets called

raoshashank commented 9 months ago

Yes, torch.where is differentiable but mask = ddope.renders["rgb"] > 0 is not. The reason i ask this, is because the mask seems to be a much less noisy training signal (and thus I want to use it), however when I set the weights of rgb and depth losses to 0, the gradients for the object3d parameters are 0 (for the sample example provided in the repo). I suspected this was due to the greater-than operation which typically is not differentiable, for example:

import torch
a = torch.randn((10,)).requires_grad_(True)
b = a>0
print(b.requires_grad)

would give False

TontonTremblay commented 9 months ago

I might be a little confused, I can run this:

import torch

# Create a tensor (assuming it depends on some variable x)
x = torch.tensor([1.0, 2.0, -0.5], requires_grad=True)
tensor = x.clone()  # Create a copy to avoid modifying the original variable

# Apply the condition
condition = tensor > 0
tensor[condition] = 1

# Define some scalar function that uses the tensor
loss = tensor.sum()

# Compute the gradient
grad_x = torch.autograd.grad(loss, x, create_graph=True)[0]

print(grad_x)

Anyway something we can do is. https://github.com/NVlabs/diff-dope/blob/b73ecd2221b80fe14e496bd9087095b6b9cc60bb/diffdope/diffdope.py#L829 change this texture for a white texture. and dont do the > in the masking.

TontonTremblay commented 9 months ago

Although I believe you when you say it is not working.

TontonTremblay commented 9 months ago

Ok I pushed a fixed.

https://imgur.com/a/ffbQbOt <- the example only using the l1_mask. Thank you so much for looking into this.

I ended up going down a rabbit hole to finally add rendering the mask. If you pull it should work for your case, you do not need to add anything, I added mask to the rendering outputs.

raoshashank commented 9 months ago

Yup this makes more sense, I think this is why there is a separate silhouette renderer in pytorch3d as well since just extracting the mask from the rgb/depth wouldn't work. Thank you so much for the quick responses! Will close this issue for now.

TontonTremblay commented 9 months ago

I was fiddling with a sigmoid to make it work. But I could not. Anyway I am glad you picked up on this, thank you so much.