TQTQliu / MVSGaussian

[ECCV 2024] MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo
https://mvsgaussian.github.io/
MIT License
366 stars 19 forks source link

Some issues during the per-scene optimization process. #13

Closed zl264 closed 1 month ago

zl264 commented 2 months ago

Thank you very much for this excellent work! I have encountered some issues during the per-scene optimization process.

  1. While reviewing the code, I noticed that in per-scene optimization (where only 3D Gaussians are optimized), 3DGS is initialized using the point cloud (including xyz and rgb) provided by MVSGaussian, rather than the 3D Gaussians (including xyz, rgb, rotation ,scale and opacity) from MVSGaussian. In theory, initializing with 3D Gaussians instead of a point cloud would result in better initialization. Is my understanding correct?
  2. I am having trouble loading the dataset for per-scene optimization. Following the release code, I multiplied the Translation T by the scale_factor when loading the colmap pose, like this: T = T * scale_factor but this approach fails during the optimization process, preventing the loss from decreasing. However, when I set the scale_factor to 1, the loss decreases and normal scene optimization can be performed.
  3. I found that when MVSGaussian uses different scale_factors for inference, although large scale_factors yields better render images, the number and shape of the point cloud remain very similar, with only the scale differing. I would like to know if using different scale_factors during the inference process results in point clouds of different scales, and whether the final optimization results in the per-scene optimization process are similar. From my understanding, if the initial point cloud distributions are similar, the final optimization results should be close.
TQTQliu commented 2 months ago

Thanks for your interest in our work.

  1. Your observation is correct. During the per-scene optimization, we only use the point cloud (including xyz and rgb) provided by MVSGaussian as initialization, not all Gaussian attributes (xyz, rgb, rotation, scaling, and opacity). Intuitively, loading all Gaussian attributes as initializations would yield better results, but we have tried loading all attributes and found that this approach does not necessarily lead to better results. We found that 3DGS relies on initialization heavily, and the key is to accurately initialize the location of the point cloud and the number of points to be dense (especially for complex scenes), while other attributes, such as rotation, scaling, and opacity, may be sufficient using the default initialization methods designed by 3DGS.
  2. First of all, let's explain why we multiply the translation T by the scale_factor. Since our model was trained on the DTU dataset (depth range 425 to 905), when testing on other new datasets, we used scale_factor to adjust the depth of the new scenario to be close to the depth of the DTU dataset to get good results. For example, for the nerf synthetic dataset with a depth range of 2.5 to 5.5, we set scale_factor to 200. The depth is multiplied by scale_factor, and accordingly the translation T should be also multiplied by scale_factor. Therefore, in the subsequent per-scene optimization phase, the depth of the initialized point cloud is actually scaled by scale_factor, so the corresponding translation T also needs to be scaled. If you use your own data, you need to determine whether the translation T needs to be scaled based on whether the depth of the initial point cloud is scaled.
  3. The scale_factor is not the bigger the better, but is adjusted according to the depth range, as described in 2.
zl264 commented 2 months ago

Thanks for your interest in our work.

  1. Your observation is correct. During the per-scene optimization, we only use the point cloud (including xyz and rgb) provided by MVSGaussian as initialization, not all Gaussian attributes (xyz, rgb, rotation, scaling, and opacity). Intuitively, loading all Gaussian attributes as initializations would yield better results, but we have tried loading all attributes and found that this approach does not necessarily lead to better results. We found that 3DGS relies on initialization heavily, and the key is to accurately initialize the location of the point cloud and the number of points to be dense (especially for complex scenes), while other attributes, such as rotation, scaling, and opacity, may be sufficient using the default initialization methods designed by 3DGS.
  2. First of all, let's explain why we multiply the translation T by the scale_factor. Since our model was trained on the DTU dataset (depth range 425 to 905), when testing on other new datasets, we used scale_factor to adjust the depth of the new scenario to be close to the depth of the DTU dataset to get good results. For example, for the nerf synthetic dataset with a depth range of 2.5 to 5.5, we set scale_factor to 200. The depth is multiplied by scale_factor, and accordingly the translation T should be also multiplied by scale_factor. Therefore, in the subsequent per-scene optimization phase, the depth of the initialized point cloud is actually scaled by scale_factor, so the corresponding translation T also needs to be scaled. If you use your own data, you need to determine whether the translation T needs to be scaled based on whether the depth of the initial point cloud is scaled.
  3. The scale_factor is not the bigger the better, but is adjusted according to the depth range, as described in 2.

Thank you very much for your reply. Regarding the 2. and 3., I would like to add the following:

  1. I tested this on my own dataset with a depth range of 2 to 4. By sequentially increasing the scale_factor from 1 to 100, I observed a gradual improvement in the rendered images.
  2. As the scale_factor increased, I also visualized the resulting point clouds and noticed that their scale increased accordingly. Therefore, multiplying T by a scale should be right. Here is my code for loading camera poses:
 def readColmapCameras(cam_extrinsics, cam_intrinsics, path, images_folder, size=(960, 640), scale_factor=100,
                          init_ply=None):
    cam_infos = []
    for idx, key in enumerate(cam_extrinsics):
        sys.stdout.write('\r')
        # the exact output you're looking for:
        sys.stdout.write("Reading camera {}/{}".format(idx + 1, len(cam_extrinsics)))
        sys.stdout.flush()

        extr = cam_extrinsics[key]
        intr = cam_intrinsics[extr.camera_id]
        h_o, w_o = intr.height, intr.width
        height = size[0]
        width = size[1]
        uid = intr.id

        R = np.transpose(qvec2rotmat(extr.qvec))
        T = np.array(extr.tvec)
        T = T * scale_factor
        if intr.model == "SIMPLE_PINHOLE":
            focal_length_x = intr.params[0]
            focal_length_x = focal_length_x * width / w_o
            FovY = focal2fov(focal_length_x, height)
            FovX = focal2fov(focal_length_x, width)
        elif intr.model == "PINHOLE":
            focal_length_x = intr.params[0]
            focal_length_y = intr.params[1]
            focal_length_x = focal_length_x * width / w_o
            focal_length_y = focal_length_y * height / h_o
            FovY = focal2fov(focal_length_y, height)
            FovX = focal2fov(focal_length_x, width)
        elif intr.model == "SIMPLE_RADIAL":
            focal_length_x = intr.params[0] * width / w_o
            focal_length_y = intr.params[0] * height / h_o
            FovY = focal2fov(focal_length_y, height)
            FovX = focal2fov(focal_length_x, width)
        else:
            assert False, "Colmap camera model not handled: only undistorted datasets (PINHOLE or SIMPLE_PINHOLE cameras) supported!"

        image_path = os.path.join(images_folder, os.path.basename(extr.name))
        image_name = os.path.basename(image_path).split(".")[0]
        image = Image.open(image_path)
        image = (np.array(image)).astype(np.float32)
        image = cv2.resize(image, size[::-1], interpolation=cv2.INTER_AREA)
        image = Image.fromarray(image.astype(np.uint8))
        cam_info = CameraInfo(uid=uid, R=R, T=T, FovY=FovY, FovX=FovX, image=image,
                              image_path=image_path, image_name=image_name, width=width, height=height)
        cam_infos.append(cam_info)
    sys.stdout.write('\n')

    return cam_infos 

When I set the scale_factor to 1 while using MVSGaussian for both reference and per-scene optimization, the optimization process works correctly. However, when I set the scale_factor to other values, such as 100, MVSGaussian achieves higher PSNR in the rendered images during reference but fails to optimize correctly during per-scene optimization.

TQTQliu commented 2 months ago
  1. For scenes with a depth range of 2 to 4, increasing the scale_factor from 1 to 100 makes sense, as the adjusted depth gradually approaches the depth range of the DTU.
  2. According to your feedback, "when scale_factor is set to 100, generalization inference is normal but per-scene optimization is not correct, but when scale_factor is set to 1, per-scene optimization returns to normal". This phenomenon is strange. Please check: i) Whether the initialized point cloud is the result of the generalizable model with scale_factor 100. ii) During generalization inference and per-scene optimization, you can output the value of "T" to check whether it is multiplied by 100.