Trying extrinsics optimization on a grid-based NeRF

LvisRoot commented 1 year ago

Hi there. First of all, thank you for open sourcing this super useful repo.

I wanted to do pose optimization within a wisp pipeline, leveraging the kaolin.Camera class, which is differentiable OOTB. I created a pipeline that transforms rays on each training step with updated extrinsics, but the gradients to the extrinsics parameters weren't propagating properly.

After some debugging, I found that when using a hash grid the CUDA backward implementation of interpolate only computes the gradients for the codebook parameters. https://github.com/NVIDIAGameWorks/kaolin-wisp/blob/cb47e10f376e5ac8b6965c650d8a6b85b9bc968e/wisp/csrc/ops/hashgrid_interpolate.cpp#L96

I was wondering if it Would it be possible to add the gradient computation for the coordinates as well, since it would be a great enhancement to make codebook-based pipelines fully differentiable up to the camera poses.

LvisRoot commented 1 year ago

After taking a look at the current main branch, I found out that there's a recent CUDA implementation of coordinate gradients computation.

I modify the backward function so it would return the grad_coord and now get non-zero gradients for the camera extrinsics. I run a few experiments in the replica dataset by adding some noise to the camera poses to check if pose optimization works, trying to do something similar to BARF by adding LOD aneeling and trying different learning rates for the poses: https://github.com/chenhsuanlin/bundle-adjusting-NeRF

However, so far I only get even blurrier renders when I optimize the extrinsics.

Has anyone already try something similar and would have some suggestions on what I could be missing in the optimization process?

Also I'm not too familiar with kaolin.Cameras optimize friendly matrix_6dof_rotation representation, and I'm tempted to change to an se3 representation instead since it is what I've always used in SLAM systems so far. Do you have any insights related to which representation could work better?

Thanks in advance :)

orperel commented 1 year ago

Hi @LvisRoot !

The 6DoF representation is from Zhou et al. 2019: https://arxiv.org/abs/1812.07035 The SE3 representation is less suitable for differentiation (you have to force the matrix to be orthogonal, but that may not be guaranteed - see paper above)

The kaolin docs further elaborate the difference:

class _MatrixSE3Rep(ExtrinsicsRep):
        """
        4x4 matrix form of rigid transformations from SE(3), the special Euclidean group.
        Uses the identity mapping from representation space to transformation space,
        and thus simple and quick for non-differentiable camera operations.
        However, without additional constraints, the over-parameterized nature of this representation
        makes it unsuitable for optimization (e.g: transformations are not guaranteed to remain in SE(3)
        during backpropagation).
       """

whereas:

class _Matrix6DofRotationRep(ExtrinsicsRep):
    """ A representation space which supports differentiability in the space of rigid transformations.
    That is, the view-matrix is guaranteed to represent a valid rigid transformation.
    Under the hood, this representation keeps 6 DoF for rotation, and 3 additional ones for translation.
    For conversion to view-matrix form, a single Gram–Schmidt step is required.
    See: On the Continuity of Rotation Representations in Neural Networks, Zhou et al. 2019
    """

You can modify the following example in kaolin to see the difference between the two: https://github.com/NVIDIAGameWorks/kaolin/blob/master/examples/recipes/camera/cameras_differentiable.py

The backend representation can be picked with switch_backend: https://kaolin.readthedocs.io/en/latest/modules/kaolin.render.camera.camera_extrinsics.html#kaolin.render.camera.CameraExtrinsics.switch_backend

orperel commented 1 year ago

I'd actually suspect the hashgrid interpolation by coords.

First thing we can do is validate if there is a potential bug here:

You can verify this by temporarily trying a different grid representation (i.e. OctreeGrid) - the other grids in wisp use a different trilinear interpolation logic
Some users have suggested tinycudann (it's easily compatible with wisp), see the thread here: https://github.com/NVIDIAGameWorks/kaolin-wisp/issues/41

LvisRoot commented 1 year ago

Hi @orperel , thanks for your answer!

I saw the reference to the Matrix6DofRotation paper in kaolin's documentation, but hadn't have the time to look at it in depth, so I folded back to using Lie se3 in the tangent space (not SE3 groups) which I know is suitable for optimization as well. I'm transforming them to SE3 to rotate the rays before tracing.

I've been trying the following grids lately:

wisp-hasgrid
tinycudann-hasgrid
wisp-triplanar

Run a bunch of experiments changing the pose noise strength, lr and extrinsic_opt_lr. The only thing that worked for me was:

wisp-triplanar grid (lr=0.01)
extrinsics_lr=0.001
XYZ angular noise with std=4.5deg
XYZ norm translation noise std=0.05

With that pose noise the representation gets pretty cloudy and noisy, but cleans up a lot when pose opt is on.

With hash-grids I spent even more time tuning parameters and I always ended up seeing poses diverging, moving into weird positions, getting a super cloudy or very smoothed out representation.

Do you have any insight why this could be?

Haven't tested with Matrix6DofRotationRep after finding the first issues, I might do it in the upcoming days to avoid my se3 pipeline.

The only thing I found around regarding pose estimation with hash-grids that work was a comment in the Instant-NGP repo, where some details about how it tackles it are explained: https://github.com/NVlabs/instant-ngp/issues/69#issuecomment-1018345113

Which is not just gradient propagation for the rotation, but cross product the directions gradient and rotate the extrinsics orientation as AFAIU. https://github.com/NVlabs/instant-ngp/blob/00754afc1fbb933c6cefc020f6c4efbb4e1c9a1b/src/testbed_nerf.cu#L1765-L1776

So that's different to the method I'm currently using. I wonder if adding this would have a big impact in the results.

LvisRoot commented 1 year ago

I run some more experiments used using the triplanar-grid with Kaolin extrinsics in Matrix6DofRotationRep only. In this case pose optimization works just as well as using se3 vectors for the extrinsics representation.

Here are some low-res renders without and with pose-optimization trained for 150 epochs (usually the takes ~400 epochs to get a good PSNR)

https://user-images.githubusercontent.com/15590898/233664165-c631598d-4a89-46c1-940f-ee589aef8229.mp4

https://user-images.githubusercontent.com/15590898/233664201-e437cc31-5454-466f-930b-b807fafe39bc.mp4

So in my case for the data I'm using (replica dataset) the issue is in the hash-grids.

I'll keep going with the triplanar grids for now, but it would be great if someone was able to make pose refinement work with hash-grids and can share some tips on how run it. I'm still interested in using hash-grids as the underlying representation.

orperel commented 1 year ago

@LvisRoot Seems like the backward function is indeed bugged, the gradient wasn't returned to the py side 🤷

I've started a quick PR to fix this: https://github.com/NVIDIAGameWorks/kaolin-wisp/pull/145

I still need to test it more before we can merge it, but you're welcome to give it a try meanwhile (don't forget to run python setup.py develop to rebuild the kernel).

LvisRoot commented 1 year ago

Hi @orperel , Thanks for following up on this.

You're right, the gradients for the coordinates were not returned by the CUDA backend. For my experiments I had changed this from the beginning to test this out (but did not open a PR or anything):

I modify the backward function so it would return the grad_coord and now get non-zero gradients for the camera extrinsics.

However that didn't fix the pose opt issue. Moreover, tinycudann always returned the coordinates gradient, however it didn't work for me as well.

That's why I'm thinking that there's an underlying issue to use the plain coordinate gradients of hash grids for pose optimization :thinking: . Would be great to have some insights on why.

I'm wondering if this would also be a case for datasets with all cameras looking at a single object in contrast to Replica where you reconstruct rooms with cameras looking "from the inside".

Bin-ze commented 1 year ago

After taking a look at the current main branch, I found out that there's a recent CUDA implementation of coordinate gradients computation.

I modify the backward function so it would return the grad_coord and now get non-zero gradients for the camera extrinsics. I run a few experiments in the replica dataset by adding some noise to the camera poses to check if pose optimization works, trying to do something similar to BARF by adding LOD aneeling and trying different learning rates for the poses: https://github.com/chenhsuanlin/bundle-adjusting-NeRF

However, so far I only get even blurrier renders when I optimize the extrinsics.

Has anyone already try something similar and would have some suggestions on what I could be missing in the optimization process?

Also I'm not too familiar with kaolin.Cameras optimize friendly matrix_6dof_rotation representation, and I'm tempted to change to an se3 representation instead since it is what I've always used in SLAM systems so far. Do you have any insights related to which representation could work better?

Thanks in advance :)

Hello, sorry to bother you, I have some questions to ask you about the camera pose optimization in the nerf system: https://github.com/Totoro97/f2-nerf/issues/84， I want to add pose optimization in f2-nerf, but I encountered a similar problem to what you mentioned, can you give me some advice

LvisRoot commented 1 year ago

After taking a look at the current main branch, I found out that there's a recent CUDA implementation of coordinate gradients computation. I modify the backward function so it would return the grad_coord and now get non-zero gradients for the camera extrinsics. I run a few experiments in the replica dataset by adding some noise to the camera poses to check if pose optimization works, trying to do something similar to BARF by adding LOD aneeling and trying different learning rates for the poses: https://github.com/chenhsuanlin/bundle-adjusting-NeRF However, so far I only get even blurrier renders when I optimize the extrinsics. Has anyone already try something similar and would have some suggestions on what I could be missing in the optimization process? Also I'm not too familiar with kaolin.Cameras optimize friendly matrix_6dof_rotation representation, and I'm tempted to change to an se3 representation instead since it is what I've always used in SLAM systems so far. Do you have any insights related to which representation could work better? Thanks in advance :)

Hello, sorry to bother you, I have some questions to ask you about the camera pose optimization in the nerf system: https://github.com/Totoro97/f2-nerf/issues/84， I want to add pose optimization in f2-nerf, but I encountered a similar problem to what you mentioned, can you give me some advice

Hi @Bin-ze, I ended up not using hash grids as I wasn't able to implement/find an implementation of the gradients that wouldn't blow up.

I met other people from NVIDIA 2 months ago at a conference who used their hash grid implementation for pose-opt, but said some work had to be done in order to do pose-opt. I'm not sure if they pushed those changes to their open source repo though.

For me planar based grid approaches worked just fine for pose-opt (triplanar, TensoRF (not implemented in wisp but its easy to do from triplanar)).

A nice hash based approach that worked for me OOTB por pose-opt was https://github.com/RaduAlexandru/permutohedral_encoding which uses permutoredral grids instead of cubic ones, making it faster (less interpolations) and memory efficient for higher dimensions.

In terms of pose representations, both Tangent space se3 and matrix_6dof_rotation worked fine for me, but I stick to matrix_6dof_rotation since that's already implemented in Kaolin.

Hope this helps.

Best,

Claucho

Bin-ze commented 1 year ago

After taking a look at the current main branch, I found out that there's a recent CUDA implementation of coordinate gradients computation. I modify the backward function so it would return the grad_coord and now get non-zero gradients for the camera extrinsics. I run a few experiments in the replica dataset by adding some noise to the camera poses to check if pose optimization works, trying to do something similar to BARF by adding LOD aneeling and trying different learning rates for the poses: https://github.com/chenhsuanlin/bundle-adjusting-NeRF However, so far I only get even blurrier renders when I optimize the extrinsics. Has anyone already try something similar and would have some suggestions on what I could be missing in the optimization process? Also I'm not too familiar with kaolin.Cameras optimize friendly matrix_6dof_rotation representation, and I'm tempted to change to an se3 representation instead since it is what I've always used in SLAM systems so far. Do you have any insights related to which representation could work better? Thanks in advance :)

Hello, sorry to bother you, I have some questions to ask you about the camera pose optimization in the nerf system: https://github.com/Totoro97/f2-nerf/issues/84， I want to add pose optimization in f2-nerf, but I encountered a similar problem to what you mentioned, can you give me some advice

Hi @Bin-ze, I ended up not using hash grids as I wasn't able to implement/find an implementation of the gradients that wouldn't blow up.

I met other people from NVIDIA 2 months ago at a conference who used their hash grid implementation for pose-opt, but said some work had to be done in order to do pose-opt. I'm not sure if they pushed those changes to their open source repo though.

For me planar based grid approaches worked just fine for pose-opt (triplanar, TensoRF (not implemented in wisp but its easy to do from triplanar)).

A nice hash based approach that worked for me OOTB por pose-opt was https://github.com/RaduAlexandru/permutohedral_encoding which uses permutoredral grids instead of cubic ones, making it faster (less interpolations) and memory efficient for higher dimensions.

In terms of pose representations, both Tangent space se3 and matrix_6dof_rotation worked fine for me, but I stick to matrix_6dof_rotation since that's already implemented in Kaolin.

Hope this helps.

Best,

Claucho

thank you for your reply! I still have some questions to ask:

I noticed that instant-ngp provides the option of training internal or external parameters. Do you know their posture optimization method? Is it like you said: "who used their hash grid implementation for pose-opt, but said some work had to be done in order to do pose-opt", I can't understand their cuda implementation, about https:// Issues mentioned in github.com/Totoro97/f2-nerf/issues/84, do you have any suggestions?
Although f2-nerf is based on instant-ngp, there are many different implementations. The author mentioned implementing the pose optimization method in nerfstudio: https://github.com/nerfstudio-project/nerfstudio/blob/main/nerfstudio/cameras/camera_optimizers.py What might be needed: https://github.com/Totoro97 /f2-nerf/issues/84#:~:text=Totoro97%20commented%20yesterday-,Hi%20%40hdzmtsssw%20%40Bin%2Dze%20Thanks%20for%20your%20interest%20in%20this%20project,specific%20plan %20with%20timeline%20to%20support%20that.%20Pull%20requests%20are%20welcomed.,-Bin%2Dze%20commented, but for a person who has absolutely no cuda programming foundation and libtorch usage foundation, it is almost unrealizable , I can't find any good way, if you have a repo that can be used for reference, can you recommend it to me?
Can you give me some advice on implementing pose optimization for f2-nerf?

Best,

Bin-ze

NVIDIAGameWorks / kaolin-wisp

Trying extrinsics optimization on a grid-based NeRF #142