question about extension to multi-view (>2) inputs and cost volume

Friedrich-M commented 6 months ago

Thanks for your great contribution to this promising and interesting field.

I noticed that the paper's main experiment focused on two-view inputs, similar to PixelSplat. However, as you mentioned in the article, the MVS-based method can naturally be applied to multi-views(>2). Can the current pre-trained model directly extend to multi-view (>2) input?

Besides, the cost volume used in the paper needs the (near, far) plane for discrete depth sampling. So when we extend to other datasets w/o gt (near, far) as input, how should we deal with it? Also, while each view has a separate cost volume, when the view becomes dense and reso becomes larger, how to deal with the increased parameters and the need for cross-view information exchanging?

donydchen commented 6 months ago

Hi @Friedrich-M, thanks for your interest in our work.

Yes, the pre-trained model can be applied to an arbitrary number of sparse input views. Below I provide a simple example to test the 3-views cases.

Create an evaluation index file: assets/evaluation_index_re10k_3views.json

{"5aca87f95a9412c6": {"context": [58, 102, 133], "target": [84, 129]}, "322261824c4a3003": {"context": [33, 60, 78], "target": [38, 61]}}

Run the following:

python -m src.main +experiment=re10k \
checkpointing.load=checkpoints/re10k.ckpt \
mode=test \
dataset/view_sampler=evaluation \
test.compute_scores=true \
dataset.view_sampler.index_path=assets/evaluation_index_re10k_3views.json \
wandb.name=abl/re10k_3views \
dataset.view_sampler.num_context_views=3

Notice that the 'context' inside the json file now contains 3 views, and I set dataset.view_sampler.num_context_views=3 when running the model. The outputs will be stored under outputs/test/abl/re10k_3views. Following these two steps, you should be able to evaluate other numbers of input views and/or on other datasets.
If you want to train with $N$ input views, consider changing https://github.com/donydchen/mvsplat/blob/bcab8af97d1640e1581fdbf3cf4fd8d530395b68/src/dataset/view_sampler/view_sampler_bounded.py#L111 to return $N$ context views, and set dataset.view_sampler.num_context_views to $N$.

To decide (near, far) for other datasets (w/o COLMAP GT), in our experiments we randomly select some scenes and warp the corresponding images, and see whether the warped images look reasonable, e.g., only small regions are black (out of boundary). Below I attach a one-off script for your reference, you can modify and add it to somewhere around https://github.com/donydchen/mvsplat/blob/bcab8af97d1640e1581fdbf3cf4fd8d530395b68/src/model/encoder/costvolume/depth_predictor_multiview.py#L278

ori_images = rearrange(
    extra_info["images"], "(v b) c h w -> b v c h w", v=v, b=b
)
scene_names = extra_info["scene_names"]
intr_curr_ori = intrinsics[:, :, :3, :3].clone().detach()  # [b, v, 3, 3]
intr_curr_ori[:, :, 0, :] *= float(ori_images.shape[-1])
intr_curr_ori[:, :, 1, :] *= float(ori_images.shape[-2])
intr_curr_ori = rearrange(
    intr_curr_ori, "b v ... -> (v b) ...", b=b, v=v
)  # [2xb 3 3]
init_view_order = list(range(v))
image01 = ori_images
for idx in range(1, v):
    cur_view_order = init_view_order[idx:] + init_view_order[:idx]
    cur_images10 = ori_images[:, cur_view_order]  # (b, v, c, h, w)
    image10 = rearrange(cur_images10, "b v c h w -> (v b) c h w")
    pose_curr = pose_curr_lists[idx - 1]

    image01_warped = warp_with_pose_depth_candidates(
        image10,
        intr_curr_ori,
        pose_curr,
        1.0 / disp_candi_curr.repeat([1, 1, *image10.shape[-2:]]),
        warp_padding_mode=self.warp_padding_mode,
    )  # [B, C, D, H, W]
    image01_warped = rearrange(
        image01_warped, "(v b) ... -> b v ...", v=v, b=b
    )
    for batch_idx in range(b):
        out_dir = os.path.join(
            "warp_images",
            f"near_{near[0, 0].item():.1f}_far_{int(far[0, 0].item())}",
            (
                scene_names[batch_idx]
                if scene_names is not None
                else str(batch_idx)
            ),
        )
        os.makedirs(out_dir, exist_ok=True)
        for v_idx in range(v):
            Image.fromarray(
                (image01[batch_idx, v_idx] * 255)
                .byte()
                .permute(1, 2, 0)
                .detach()
                .cpu()
                .numpy()
            ).save(f"{out_dir}/{v_idx}ori.png")
            for d_idx in range(image01_warped.shape[3]):
                Image.fromarray(
                    (image01_warped[batch_idx, v_idx, :, d_idx] * 255)
                    .byte()
                    .permute(1, 2, 0)
                    .detach()
                    .cpu()
                    .numpy()
                ).save(
                    f"{out_dir}/{v_idx}warped_from{cur_view_order[v_idx]}_{d_idx}.png"
                )

To deal with dense input views, our MVSplat by default does not provide related effective solutions since it is not our main focus. Given that MVSplat is super lightweight, I reckon that MVSplat should be able to handle around 6 or 7 views, although we have not done experiments on these yet. Note that cost volume assumes that there are enough overlaps among input views, as the view number increase, the common overlaps among all views might become smaller, therefore naively feeding all input views (e.g., > 10 views) might not be a good choice. One intuitive way might be to partition the input views into different smaller group of nearest input views. You can consider designing a corresponding view sampling strategy and implement it at https://github.com/donydchen/mvsplat/blob/main/src/dataset/view_sampler/__init__.py

Friedrich-M commented 6 months ago

Thank you for your insightful reply! They are really helpful.

donydchen / mvsplat

question about extension to multi-view (>2) inputs and cost volume #4