chenhsuanlin / bundle-adjusting-NeRF

BARF: Bundle-Adjusting Neural Radiance Fields 🤮 (ICCV 2021 oral)
MIT License
781 stars 112 forks source link

Misaligned axes when converting LLFF data format to BARF coordinate frame #27

Open mniverthi opened 2 years ago

mniverthi commented 2 years ago

In data/llff.py, in the parse_cameras_and_bounds method, we are converting the poses_bounds.npy and ingesting it to use the given camera poses. Per LLFF's specification, it seems like the convention of this dataset has the transformation matrix for axes [down, right, backward] (i.e. positive x is down, positive y is right, positive z is backward).

Per line 49 in the aforementioned file, it seems like we are swapping these axes to switch to a new convention

poses_raw[...,0],poses_raw[...,1] = poses_raw[...,1],-poses_raw[...,0]

moving from [down, right, backward] to [right, up, backward] (i.e. positive x is right, positive y is up, positive z is backward).

However, the translation vector doesn't seem to be receiving the same modification. Is this behavior intended, as it seems inconsistent with the rest of the change?

mniverthi commented 2 years ago

Additionally, I wanted to confirm that the poses you were using are the world frame with respect to the camera frame (so you have to invert the pose to go from points in the camera frame to points in the world frame). It seems like this based on cam2world in camera.py, but just wanted to make sure

mniverthi commented 2 years ago

Additionally in the parse_raw_camera function it seems like you are rotating around the x axis (right) by pi radians, and its somewhat unclear why this is being done. Any guidance would be appreciated.

mniverthi commented 2 years ago

Also, I was wondering what your process for incorporating poses (aligned to an arbitrary world coordinate frame) would be.

chenhsuanlin commented 2 years ago

Hi @mniverthi, the pose convention is [right, down, forward] throughout this codebase. This is the standard form of camera extrinsic projection matrices in the notation of multi-view geometry -- it transforms a world-frame 3D point to the camera frame (through the cam2world() function). Please also see discussions in #5. The parse_raw_camera() function transforms the raw poses (output of parse_cameras_and_bounds()) from [right, up, backward] to [right, down, forward]. I'm not sure what you mean by your last question.

Hope these help!

mniverthi commented 2 years ago

Ok, thanks, I'll take a look. Lastly, I was wondering what the purpose of the center_camera_poses function was. Is it to effectively center the coordinate frame around a new point so that regardless of where the cameras are in their original frame, they are roughly in the center of the world coordinate frame used throughout the codebase?

chenhsuanlin commented 2 years ago

The center_camera_poses() function is a rewrite from the original NeRF codebase (here). I believe the main idea is to set the "average" camera pose the identity transform (since COLMAP-computed poses could live in arbitrary coordinate systems), so that the scene representation MLP optimization could be more well-behaved, without having to learn to compensate biases in the coordinates. You may also find some useful discussions in bmild/nerf#34.

mniverthi commented 2 years ago

Ok, thanks, that makes sense.

mniverthi commented 2 years ago

Additionally, was still a bit confused about the behavior of parse_raw_camera.

    def parse_raw_camera(self,opt,pose_raw):
        pose_flip = camera.pose(R=torch.diag(torch.tensor([1,-1,-1]))) 
        pose = camera.pose.compose([pose_flip,pose_raw[:3]]) # right, up, backward -> right, down, forward
        pose = camera.pose.invert(pose) # c2w -> w2c
        pose = camera.pose.compose([pose_flip,pose]) # right, down, forward -> right, up, backward?
        return pose

This is the function used in llff.py. However, this pattern of transforms seems to conflict with the fact that parse_raw_camera is guaranteed to output something that is right, down, forward, as you mentioned in #5. The inverse to go from c2w to w2c makes sense, as camera.cam2world seems to invert pose

def cam2world(X,pose):
    X_hom = to_hom(X)
    pose_inv = Pose().invert(pose)
    return X_hom@pose_inv.transpose(-1,-2)

However, the following flip operation seems to happen twice, which would revert back to right, up backward, if I'm not mistaken.

Any clarification here would be greatly appreciated, as I'm trying to integrate my own pose data into barf, but I keep running into large rotational errors which I suspect might be due to misunderstanding/confusion here.

chenhsuanlin commented 2 years ago

We have the parse_raw_camera() function really as a mere blackbox function to convert the LLFF format to the standard extrinsics format. I could revisit the LLFF pose format and verify the math for this, but I don't have many cycles for this right now. I'm not sure what you meant by large rotation errors, but if you're running BARF with the poses initialized to identity, the actual format of the pose shouldn't really matter much.

chenhsuanlin commented 2 years ago

Closing due to inactivity, please feel free to reopen if necessary.

AIBluefisher commented 2 years ago

We have the parse_raw_camera() function really as a mere blackbox function to convert the LLFF format to the standard extrinsics format. I could revisit the LLFF pose format and verify the math for this, but I don't have many cycles for this right now. I'm not sure what you meant by large rotation errors, but if you're running BARF with the poses initialized to identity, the actual format of the pose shouldn't really matter much.

I think after the last flip, we are actually using the OpenGL coordinate system, refer to the README of the original nerf. However, I'm confused a lot about the coordinate system of your codebase, since the blender.py only flips once, which is consistent with the multi-view geometry definition, while the llff.py is using the OpenGL coordinate system.

wujun-cse commented 2 years ago

I think after the last flip, we are actually using the OpenGL coordinate system, refer to the README of the original nerf. However, I'm confused a lot about the coordinate system of your codebase, since the blender.py only flips once, which is consistent with the multi-view geometry definition, while the llff.py is using the OpenGL coordinate system.

Agree. The transformation here is quite confusing. According to the camera.cam2world.py, the pose should reflect the transformation from the world to the camera frame in traditional multi-view geometry form, while after parse_raw_camera function, the poses are in OpenGL form.

chenhsuanlin commented 2 years ago

Thanks for the discussions. I will look into this part again in the coming days. Reopening the issue for now.

yuzhongruicn commented 1 year ago

Additionally, was still a bit confused about the behavior of parse_raw_camera.

    def parse_raw_camera(self,opt,pose_raw):
        pose_flip = camera.pose(R=torch.diag(torch.tensor([1,-1,-1]))) 
        pose = camera.pose.compose([pose_flip,pose_raw[:3]]) # right, up, backward -> right, down, forward
        pose = camera.pose.invert(pose) # c2w -> w2c
        pose = camera.pose.compose([pose_flip,pose]) # right, down, forward -> right, up, backward?
        return pose

This is the function used in llff.py. However, this pattern of transforms seems to conflict with the fact that parse_raw_camera is guaranteed to output something that is right, down, forward, as you mentioned in #5. The inverse to go from c2w to w2c makes sense, as camera.cam2world seems to invert pose

def cam2world(X,pose):
    X_hom = to_hom(X)
    pose_inv = Pose().invert(pose)
    return X_hom@pose_inv.transpose(-1,-2)

However, the following flip operation seems to happen twice, which would revert back to right, up backward, if I'm not mistaken.

Any clarification here would be greatly appreciated, as I'm trying to integrate my own pose data into barf, but I keep running into large rotational errors which I suspect might be due to misunderstanding/confusion here.

Hi, I met the same issue as you. When I tried to train BARF with my own pose data in LLFF format, there were large rotation errors. The novel views look good but the test view is a mess. Do you fix it? Thanks!

curryandklay commented 6 months ago

Additionally, was still a bit confused about the behavior of parse_raw_camera.

    def parse_raw_camera(self,opt,pose_raw):
        pose_flip = camera.pose(R=torch.diag(torch.tensor([1,-1,-1]))) 
        pose = camera.pose.compose([pose_flip,pose_raw[:3]]) # right, up, backward -> right, down, forward
        pose = camera.pose.invert(pose) # c2w -> w2c
        pose = camera.pose.compose([pose_flip,pose]) # right, down, forward -> right, up, backward?
        return pose

This is the function used in llff.py. However, this pattern of transforms seems to conflict with the fact that parse_raw_camera is guaranteed to output something that is right, down, forward, as you mentioned in #5. The inverse to go from c2w to w2c makes sense, as camera.cam2world seems to invert pose

def cam2world(X,pose):
    X_hom = to_hom(X)
    pose_inv = Pose().invert(pose)
    return X_hom@pose_inv.transpose(-1,-2)

However, the following flip operation seems to happen twice, which would revert back to right, up backward, if I'm not mistaken. Any clarification here would be greatly appreciated, as I'm trying to integrate my own pose data into barf, but I keep running into large rotational errors which I suspect might be due to misunderstanding/confusion here.

Hi, I met the same issue as you. When I tried to train BARF with my own pose data in LLFF format, there were large rotation errors. The novel views look good but the test view is a mess. Do you fix it? Thanks!

Also with you experiencing this problem, the rotation error has a mean value of 120 degrees (that's 2.23 radians), while the translation error has a mean value of only about 0.6. Unable to find out what the problem is, have you solved this problem yet? Any reply would be appreciated!