Why normalize the intrinsics and extrinsics in the covert_dtu.py?

Miaosheng1 commented 1 month ago

For the DTU dataset, I noticed your code has normalized the camera intrinsics in the convert.py? As a result, the pose and intrinsics in the code are hard to understand and it is not convenient to use this code for 3rd dataset (Waymo or Mipnerf360).

donydchen commented 1 month ago

Hi @Miaosheng1, thanks for your interest in our work.

We mainly normalised the DTU intrinsic to align with our default RE10K data loader, which assumes a normalised intrinsic as input. Besides, since we use the camera parameters in different resolutions, e.g., image and feature resolutions (for building the feature cost volume), it would be more convenient to normalise them from the input and scale them back to a specific resolution when needed.

Normalising the intrinsic mainly involves dividing it by the initial height and width. Suppose the original intrinsic is

K = [[width * focal_length_x,             0,            width * principal_point_x],
     [          0           , height * focal_length_y, height * principal_point_y],
     [          0           ,             0,                       1             ]]

Then, you just need to divide the first row by width and the second by height, and that's it. The remaining parts of the convert_dtu.py mainly aim to align the cropped data with the initial camera parameters, which is unnecessary for other datasets.

For the camera extrinsic, MVSplat uses the OpenCV-style camera-to-world matrices. So, basically, +Z is camera look, +X is camera right, and -Y is camera up.

Let me know if you have encountered any other difficulties. Cheers.

Miaosheng1 commented 1 month ago

Thank you for your clear response! I also noticed that the scale parameters are set to 200 in convert.py, this means that the translation is divided 200 in your code. How do you set the hyperparameter and what does it mean? Besides, as you normalize the intrinsics, the input image is cropped in this code function: apply_crop_shim. So my question is what if i do not crop the image(means cx != image/width) in dtu and do not normalize the intrinsics at the same time, dose it work?

donydchen commented 1 month ago

Hi @Miaosheng1, when I said "align the cropped data with the initial camera parameters" above, here the "cropped" means that we used the preprocessed "cropped" version of the DTU dataset obtained from MVSNeRF, and it does not refer to the crop in our code.

Again, this scale factor is only for DTU. We adopt this value from MVSNeRF, see here. For other datasets, you don't need to apply this scale operation; you can just normalize the intrinsic as aforementioned, and you're good to go.

Ideally, it will work with unnormalized input, given that you check through all the details of the code and ensure that all intrinsic matrics are correctly aligned with its operated resolution, i.e., below for the feature volume, which is 1/4 of image resolution,

https://github.com/donydchen/mvsplat/blob/378ff818c0151719bbc052ac2797a2c769766320/src/model/encoder/costvolume/depth_predictor_multiview.py#L108-L111

Miaosheng1 commented 1 month ago

Thank you for your suggestion. In your provided DTU dataset, the image size is (512×640). In your code, the image was cropped (mainly in function: apply_crop_shim), and the processed image size is (256×256). This process definitely lead the principal point locate the center of the image. I think this processed image(256×256) corresponding to the normalized intrinsic you mentioned. When i change the image size (256×256) -> (512×640) in dtu.yaml, it leads some ghost artifacts, see blew. Should I change the normalized camera intrinsic generation (in covert_dtu.py) when i want to render the image with origin size 512×640 (the principal point does not locate at the image center in this image size) . btw, if the principal point does not locate at the image center, the 3DGS projection matrix should be changed. 000032

donydchen commented 1 month ago

Hi @Miaosheng1, in our code, we assume the image shapes are even (see here) and mainly perform the center crop operation, which is symmetric and should not change the normalized principal point. Since you use the DTU original shape and do not perform any operation like asymmetric cropping, the normalized principal point is still at (0.5, 0.5). The convert script only normalises the original DTU-provided camera intrinsic, which is not tied to the resolution in our MVSplat data loader.

In fact, the Gaussian Splatting rasterizer we used only supports the principal point at the image center, so consider this when applying any related operations.

Additionally, I think the results you get look reasonably good. Since the model is trained on the 256x256 real estate dataset, its performance might be suboptimal when applied to other resolutions and/or data types. To get better performances, you might consider fine-tuning the model on the DTU training set with its original shape.

Miaosheng1 commented 1 month ago

Thanks

ShunyuanZheng commented 1 month ago

Hi @Miaosheng1, in our code, we assume the image shapes are even (see here) and mainly perform the center crop operation, which is symmetric and should not change the normalized principal point. Since you use the DTU original shape and do not perform any operation like asymmetric cropping, the normalized principal point is still at (0.5, 0.5). The convert script only normalises the original DTU-provided camera intrinsic, which is not tied to the resolution in our MVSplat data loader.

In fact, the Gaussian Splatting rasterizer we used only supports the principal point at the image center, so consider this when applying any related operations.

Additionally, I think the results you get look reasonably good. Since the model is trained on the 256x256 real estate dataset, its performance might be suboptimal when applied to other resolutions and/or data types. To get better performances, you might consider fine-tuning the model on the DTU training set with its original shape.

Hi @donydchen, I am confused about something similar here.

Since the DTU dataset's principal points are not located at the image center, are the w, h here right calculated? So do the cx and cy here, it seems they are forced at the center of the image, which is not the case in reality.

donydchen commented 1 month ago

Hi @ShunyuanZheng, you are correct. The original principal point of DTU is slightly off-center; we force it to be in the center for simplicity.

Take the scene 'scan1_train' for an example, by going through the preprocessing, its intrinsic (at here) is

intr = [[1446.165, 0.0,     331.602 ], 
        [0.0,      1441.59, 265.5345], 
        [0.0,      0.0,     1.0.    ]]

Since its image shape is (512, 640), the correct normalized principal point should be around (0.5181, 0.5186), which is slightly off-center. Of course, one can apply cropping to shift the principal point to the exact center. Still, I think that forcing it to be in the center is within reasonable variation and easy to understand for quick testing.

ShunyuanZheng commented 1 month ago

Hi @ShunyuanZheng, you are correct. The original principal point of DTU is slightly off-center; we force it to be in the center for simplicity.

Take the scene 'scan1_train' for an example, by going through the preprocessing, its intrinsic (at here) is
intr = [[1446.165, 0.0,     331.602 ], 
        [0.0,      1441.59, 265.5345], 
        [0.0,      0.0,     1.0.    ]]
Since its image shape is (512, 640), the correct normalized principal point should be around (0.5181, 0.5186), which is slightly off-center. Of course, one can apply cropping to shift the principal point to the exact center. Still, I think that forcing it to be in the center is within reasonable variation and easy to understand for quick testing.

I see. Thanks for your quick reply!

donydchen / mvsplat

Why normalize the intrinsics and extrinsics in the covert_dtu.py? #28