Closed Friedrich-M closed 6 months ago
Hi @Friedrich-M, thanks for your interest in our work.
Yes, the pre-trained model can be applied to an arbitrary number of sparse input views. Below I provide a simple example to test the 3-views cases.
assets/evaluation_index_re10k_3views.json
{"5aca87f95a9412c6": {"context": [58, 102, 133], "target": [84, 129]}, "322261824c4a3003": {"context": [33, 60, 78], "target": [38, 61]}}
python -m src.main +experiment=re10k \
checkpointing.load=checkpoints/re10k.ckpt \
mode=test \
dataset/view_sampler=evaluation \
test.compute_scores=true \
dataset.view_sampler.index_path=assets/evaluation_index_re10k_3views.json \
wandb.name=abl/re10k_3views \
dataset.view_sampler.num_context_views=3
Notice that the 'context' inside the json file now contains 3 views, and I set dataset.view_sampler.num_context_views=3
when running the model. The outputs will be stored under outputs/test/abl/re10k_3views
. Following these two steps, you should be able to evaluate other numbers of input views and/or on other datasets.
If you want to train with $N$ input views, consider changing https://github.com/donydchen/mvsplat/blob/bcab8af97d1640e1581fdbf3cf4fd8d530395b68/src/dataset/view_sampler/view_sampler_bounded.py#L111 to return $N$ context views, and set dataset.view_sampler.num_context_views
to $N$.
ori_images = rearrange(
extra_info["images"], "(v b) c h w -> b v c h w", v=v, b=b
)
scene_names = extra_info["scene_names"]
intr_curr_ori = intrinsics[:, :, :3, :3].clone().detach() # [b, v, 3, 3]
intr_curr_ori[:, :, 0, :] *= float(ori_images.shape[-1])
intr_curr_ori[:, :, 1, :] *= float(ori_images.shape[-2])
intr_curr_ori = rearrange(
intr_curr_ori, "b v ... -> (v b) ...", b=b, v=v
) # [2xb 3 3]
init_view_order = list(range(v))
image01 = ori_images
for idx in range(1, v):
cur_view_order = init_view_order[idx:] + init_view_order[:idx]
cur_images10 = ori_images[:, cur_view_order] # (b, v, c, h, w)
image10 = rearrange(cur_images10, "b v c h w -> (v b) c h w")
pose_curr = pose_curr_lists[idx - 1]
image01_warped = warp_with_pose_depth_candidates(
image10,
intr_curr_ori,
pose_curr,
1.0 / disp_candi_curr.repeat([1, 1, *image10.shape[-2:]]),
warp_padding_mode=self.warp_padding_mode,
) # [B, C, D, H, W]
image01_warped = rearrange(
image01_warped, "(v b) ... -> b v ...", v=v, b=b
)
for batch_idx in range(b):
out_dir = os.path.join(
"warp_images",
f"near_{near[0, 0].item():.1f}_far_{int(far[0, 0].item())}",
(
scene_names[batch_idx]
if scene_names is not None
else str(batch_idx)
),
)
os.makedirs(out_dir, exist_ok=True)
for v_idx in range(v):
Image.fromarray(
(image01[batch_idx, v_idx] * 255)
.byte()
.permute(1, 2, 0)
.detach()
.cpu()
.numpy()
).save(f"{out_dir}/{v_idx}ori.png")
for d_idx in range(image01_warped.shape[3]):
Image.fromarray(
(image01_warped[batch_idx, v_idx, :, d_idx] * 255)
.byte()
.permute(1, 2, 0)
.detach()
.cpu()
.numpy()
).save(
f"{out_dir}/{v_idx}warped_from{cur_view_order[v_idx]}_{d_idx}.png"
)
Thank you for your insightful reply! They are really helpful.
Thanks for your great contribution to this promising and interesting field.
I noticed that the paper's main experiment focused on two-view inputs, similar to PixelSplat. However, as you mentioned in the article, the MVS-based method can naturally be applied to multi-views(>2). Can the current pre-trained model directly extend to multi-view (>2) input?
Besides, the cost volume used in the paper needs the (near, far) plane for discrete depth sampling. So when we extend to other datasets w/o gt (near, far) as input, how should we deal with it? Also, while each view has a separate cost volume, when the view becomes dense and reso becomes larger, how to deal with the increased parameters and the need for cross-view information exchanging?