Creating a synthetic dataset with Unity: overly white images

AndreaMas commented 3 years ago

Hi, I'm trying to create a synthetic dataset of a scene within Unity3D. I acquire images of a human model, spawning cameras in a semicircle. I'm able to create an identical dataset to the ones blender generates.

When running NeRF I get weird results. The output appears "ghostly" and the overall result is blurred (besides, this is a lucky case, often it's just all white).

Untitled 2

On the other hand, running NeRF using COLMAP to estimate the camera positions (using similar input images) gives really nice results.

The position precision of the camera is slightly less accurate respect to the one Blender generates (since Unity uses float32 and not float 64 for storing position) bui I doubt this is the problem.

So … what might the problem be?

PS: Thanks to the authors for sharing their work and code with everyone.

AndreaMas commented 3 years ago

The problem is not the inferior precision in camera position.

I've verified this with the lego dataset in this repository. Modifying the transform_matrix in the json file, replacing float64 values with float32 ones, results still look as good.

AndreaMas commented 3 years ago

The issue for the "ghostly" effect was given by the absence of transparency in the training images. Images for blender datasets must have a transparent background, not white.

Also, the Unity's camera-to-world matrix must have its 2nd and 3rd row swapped, given that blender has the y axis pointing upwards, unlike Unity where its the z axis that points up.

raynehe commented 1 year ago

@AndreaMas Hi! Sorry to bother. Could you please tell me what does it mean by "Images for blender datasets must have a transparent background, not white"? I mean If the input are RGBA images, we should transform then to RGB images right Thanks you very much!

AndreaMas commented 1 year ago

Hi @raynehe, no bother at all. Can you further describe the issue you're having?

I'll describe further the issue I had. I generated my own synthetic dataset in Unity, which had RGB images and camera positions and orientations with it. I realized that I was getting NeRF renders witch where very white. I solved the problem by fixing my dataset camera orientations and by giving a transparent background to my subject, not sure which was the solution of the two.

If you're having the same issue, I'd suggest to:

be extra sure that your camera position and orientations are correct, perhaps use this script.
check if the object captured is correctly located inside the [-1,1] cube
if the above don't work, perhaps give a transparent background to your object

AndreaMas commented 1 year ago

Also, NeRF supports alpha channel, so it's not mandatory to transform RGBA images into RGB ones, although I'm not 100% sure this is the case for non-synthetic datasets

raynehe commented 1 year ago

@AndreaMas Thank you so much for your quick and detailed reply!

My datasets are RGBA images and I use this script to translate them to white-background RGB images:

image = image[:,:3]*image[ :,-1:]·+·(l-imager;. -1:])

My issue is, VolSDF's inputs are in DTU dataset format (neither BlendedMVS nor Colmap, nor the one we used in NeRF), which requires a Camera projection matrix comprised by intrinsic and extrinsic. I manually convert the "transform_matrix" in the "transfroms_train.json" (generated by blender) to the extrinsics, but when training with VolSDF, the result shows there's something wrong with the camera parameter.

I think there might be two reasons:

There might be something wrong from "transform_matrix" to camera extrinsics. I'm still figuring out the format of the pose matrix of these two dataset, but there seems to be limited information. There're some small questions I'm wondering if you know. In the .blend files, we got "transform_matrix" using:
```
'transform_matrix': listify_matrix(cam.matrix_world)
```
Is cam.matrix_world world_to_camera matrix or camera_to_world? Is it "x right y up z back"?
The position are located inside the cube of radius 11.0 instead of 1.0 (because the virtual camera are placed on a sphere with radius=11.0). Could you give me a hint on how to change the camera parameter after normalizing the 3D positions?

Many Thanks!

raynehe commented 1 year ago

Also, I set the intrinsic matrix using:

focal = .5 * W / np.tan(0.5 * self.meta['camera_angle_x'])
intrinsics = np.array([[focal， 0，0.5*w],
                       [focal,0,0.5*H],
                       [0,0,1]]).tolist()

AndreaMas commented 1 year ago

This is what I found out on the transforms.json so far.

The transforms.json file contains the intrinsics and extrinsics of the cameras in the scene, the most important ones are:

camera_angle_x: field-of-view in the x dimension, in radiants.
frames: list of the camera-to-world transform matrices for each image.

Cameras share the OpenGL convention:

point towards the negative z axis
the positive x axis on the right
the positive y axis on top

The following is less relevant but might be useful. The json files were originally generated by the NeRF authors within Blender, whose world axis convection is:

z upwards
x and y respect the right hand rule.

A transform_martix as such:

[1,0,0,0] [0,1,0,0] [0,0,1,0] [0,0,0,1]

represents a camera placed in the origin, pointing down, with the positive x axis on the right and the positive y axis on top.

If you find yourself within a different 3D program with different axis convenction, you'll need to account for that when generating datasets, (probably by swapping the rows of the matrix, or making rows negative).

AndreaMas commented 1 year ago

It's also important to understand the difference between:

camera-to-world-matrix
world-to-camera-matrix
perspective-projection-martix

I'm still struggling with this topic, the following might contain errors, so please check on this.

World to camera matrix

brings the points in the world space into camera space. Basically, rotates and translates the whole world so that the camera is now centered on the origin.

world-to-camera-matrix = [ M b ]
                         [ 0 1 ]

where M is a 3x3 matrix and b is a 3x1 vector. NOTE: sometimes you will find M called R and b called t, I avoided such naming convention cause I'll use those letters for the next matrix, just be aware of this.

Camera to world matrix

the inverse of the above, brings points from the camera space into the world space. This matrix therefore inherently represents the position and orientation of the camera within the world space.

camera-to-world-matrix = inv( world-to-camera-matrix )
                       = [ inv(M)    -inv(M) * b ]
                         [   0             1     ]
                       = [ R t ]
                         [ 0 1 ]

where R is a 3x3 martix that represents the rotation of the camera in world space, while t its a 3x1 vector that represents the camera position (aka translation wrt the origin).

Perspective Projection matrix

Also known as PPM or camera-projection-matrix, projects the points in the world space onto the camera image plane. It is composed of intrinsics and extrinsics matrices:

PPM = K * world-to-camera-matrix

where K is a 4x4 matrix of camera instrinsics.

As said, the PPM is used to project a point in world space to the camera image, like so:

p = PPM * P

where P is a point in world coordinates (Xw,Yw,Zw), while p is the same point projected onto the image plane (u,v).

Best explanations on the topic I found so far:

AndreaMas commented 1 year ago

@raynehe replying in short to your questions:

~~world-to-camera~~ camera-to-world matrix with camera axis x right y up z back
I don't think you would need to change the camera intrinsics upon rescaling the world, since the field-of-view is in angles.

AndreaMas commented 1 year ago

Also, I set the intrinsic matrix using:

focal = .5 * W / np.tan(0.5 * self.meta['camera_angle_x'])
intrinsics = np.array([[focal， 0，0.5*w],
                       [focal,0,0.5*H],
                       [0,0,1]]).tolist()

I think this intrinsics matrix is mostly correct, but perhaps there's a typo?

focalX = .5 * W / np.tan(0.5 * self.meta['camera_angle_x'])
focalY = .5 * H / np.tan(0.5 * self.meta['camera_angle_y'])
intrinsics = np.array([ [ focalX, 0, 0.5*W ],
                        [ 0, focalY, 0.5*H ],
                        [ 0, 0 , 1 ]  ]).tolist()

bmild / nerf

Creating a synthetic dataset with Unity: overly white images #119