Open mniverthi opened 2 years ago
Additionally, I wanted to confirm that the poses you were using are the world frame with respect to the camera frame (so you have to invert the pose to go from points in the camera frame to points in the world frame). It seems like this based on cam2world
in camera.py
, but just wanted to make sure
Additionally in the parse_raw_camera
function it seems like you are rotating around the x axis (right) by pi
radians, and its somewhat unclear why this is being done. Any guidance would be appreciated.
Also, I was wondering what your process for incorporating poses (aligned to an arbitrary world coordinate frame) would be.
Hi @mniverthi, the pose convention is [right, down, forward] throughout this codebase. This is the standard form of camera extrinsic projection matrices in the notation of multi-view geometry -- it transforms a world-frame 3D point to the camera frame (through the cam2world()
function). Please also see discussions in #5.
The parse_raw_camera()
function transforms the raw poses (output of parse_cameras_and_bounds()
) from [right, up, backward] to [right, down, forward]. I'm not sure what you mean by your last question.
Hope these help!
Ok, thanks, I'll take a look. Lastly, I was wondering what the purpose of the center_camera_poses
function was. Is it to effectively center the coordinate frame around a new point so that regardless of where the cameras are in their original frame, they are roughly in the center of the world coordinate frame used throughout the codebase?
The center_camera_poses()
function is a rewrite from the original NeRF codebase (here). I believe the main idea is to set the "average" camera pose the identity transform (since COLMAP-computed poses could live in arbitrary coordinate systems), so that the scene representation MLP optimization could be more well-behaved, without having to learn to compensate biases in the coordinates. You may also find some useful discussions in bmild/nerf#34.
Ok, thanks, that makes sense.
Additionally, was still a bit confused about the behavior of parse_raw_camera
.
def parse_raw_camera(self,opt,pose_raw):
pose_flip = camera.pose(R=torch.diag(torch.tensor([1,-1,-1])))
pose = camera.pose.compose([pose_flip,pose_raw[:3]]) # right, up, backward -> right, down, forward
pose = camera.pose.invert(pose) # c2w -> w2c
pose = camera.pose.compose([pose_flip,pose]) # right, down, forward -> right, up, backward?
return pose
This is the function used in llff.py
. However, this pattern of transforms seems to conflict with the fact that parse_raw_camera
is guaranteed to output something that is right, down, forward
, as you mentioned in #5. The inverse to go from c2w to w2c makes sense, as camera.cam2world
seems to invert pose
def cam2world(X,pose):
X_hom = to_hom(X)
pose_inv = Pose().invert(pose)
return X_hom@pose_inv.transpose(-1,-2)
However, the following flip operation seems to happen twice, which would revert back to right, up backward
, if I'm not mistaken.
Any clarification here would be greatly appreciated, as I'm trying to integrate my own pose data into barf, but I keep running into large rotational errors which I suspect might be due to misunderstanding/confusion here.
We have the parse_raw_camera()
function really as a mere blackbox function to convert the LLFF format to the standard extrinsics format. I could revisit the LLFF pose format and verify the math for this, but I don't have many cycles for this right now. I'm not sure what you meant by large rotation errors, but if you're running BARF with the poses initialized to identity, the actual format of the pose shouldn't really matter much.
Closing due to inactivity, please feel free to reopen if necessary.
We have the
parse_raw_camera()
function really as a mere blackbox function to convert the LLFF format to the standard extrinsics format. I could revisit the LLFF pose format and verify the math for this, but I don't have many cycles for this right now. I'm not sure what you meant by large rotation errors, but if you're running BARF with the poses initialized to identity, the actual format of the pose shouldn't really matter much.
I think after the last flip, we are actually using the OpenGL coordinate system, refer to the README of the original nerf. However, I'm confused a lot about the coordinate system of your codebase, since the blender.py only flips once, which is consistent with the multi-view geometry definition, while the llff.py is using the OpenGL coordinate system.
I think after the last flip, we are actually using the OpenGL coordinate system, refer to the README of the original nerf. However, I'm confused a lot about the coordinate system of your codebase, since the blender.py only flips once, which is consistent with the multi-view geometry definition, while the llff.py is using the OpenGL coordinate system.
Agree. The transformation here is quite confusing. According to the camera.cam2world.py, the pose should reflect the transformation from the world to the camera frame in traditional multi-view geometry form, while after parse_raw_camera function, the poses are in OpenGL form.
Thanks for the discussions. I will look into this part again in the coming days. Reopening the issue for now.
Additionally, was still a bit confused about the behavior of
parse_raw_camera
.def parse_raw_camera(self,opt,pose_raw): pose_flip = camera.pose(R=torch.diag(torch.tensor([1,-1,-1]))) pose = camera.pose.compose([pose_flip,pose_raw[:3]]) # right, up, backward -> right, down, forward pose = camera.pose.invert(pose) # c2w -> w2c pose = camera.pose.compose([pose_flip,pose]) # right, down, forward -> right, up, backward? return pose
This is the function used in
llff.py
. However, this pattern of transforms seems to conflict with the fact thatparse_raw_camera
is guaranteed to output something that isright, down, forward
, as you mentioned in #5. The inverse to go from c2w to w2c makes sense, ascamera.cam2world
seems to invert posedef cam2world(X,pose): X_hom = to_hom(X) pose_inv = Pose().invert(pose) return X_hom@pose_inv.transpose(-1,-2)
However, the following flip operation seems to happen twice, which would revert back to
right, up backward
, if I'm not mistaken.Any clarification here would be greatly appreciated, as I'm trying to integrate my own pose data into barf, but I keep running into large rotational errors which I suspect might be due to misunderstanding/confusion here.
Hi, I met the same issue as you. When I tried to train BARF with my own pose data in LLFF format, there were large rotation errors. The novel views look good but the test view is a mess. Do you fix it? Thanks!
Additionally, was still a bit confused about the behavior of
parse_raw_camera
.def parse_raw_camera(self,opt,pose_raw): pose_flip = camera.pose(R=torch.diag(torch.tensor([1,-1,-1]))) pose = camera.pose.compose([pose_flip,pose_raw[:3]]) # right, up, backward -> right, down, forward pose = camera.pose.invert(pose) # c2w -> w2c pose = camera.pose.compose([pose_flip,pose]) # right, down, forward -> right, up, backward? return pose
This is the function used in
llff.py
. However, this pattern of transforms seems to conflict with the fact thatparse_raw_camera
is guaranteed to output something that isright, down, forward
, as you mentioned in #5. The inverse to go from c2w to w2c makes sense, ascamera.cam2world
seems to invert posedef cam2world(X,pose): X_hom = to_hom(X) pose_inv = Pose().invert(pose) return X_hom@pose_inv.transpose(-1,-2)
However, the following flip operation seems to happen twice, which would revert back to
right, up backward
, if I'm not mistaken. Any clarification here would be greatly appreciated, as I'm trying to integrate my own pose data into barf, but I keep running into large rotational errors which I suspect might be due to misunderstanding/confusion here.Hi, I met the same issue as you. When I tried to train BARF with my own pose data in LLFF format, there were large rotation errors. The novel views look good but the test view is a mess. Do you fix it? Thanks!
Also with you experiencing this problem, the rotation error has a mean value of 120 degrees (that's 2.23 radians), while the translation error has a mean value of only about 0.6. Unable to find out what the problem is, have you solved this problem yet? Any reply would be appreciated!
In
data/llff.py
, in theparse_cameras_and_bounds
method, we are converting theposes_bounds.npy
and ingesting it to use the given camera poses. Per LLFF's specification, it seems like the convention of this dataset has the transformation matrix for axes[down, right, backward]
(i.e. positivex
is down, positivey
is right, positivez
is backward).Per line 49 in the aforementioned file, it seems like we are swapping these axes to switch to a new convention
moving from
[down, right, backward]
to[right, up, backward]
(i.e. positivex
is right, positivey
is up, positivez
is backward).However, the translation vector doesn't seem to be receiving the same modification. Is this behavior intended, as it seems inconsistent with the rest of the change?