donydchen / mvsplat

🌊 [ECCV'24 Oral] MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images
https://donydchen.github.io/mvsplat
MIT License
750 stars 35 forks source link

less effective results by overfitting on re10k subset #18

Closed Yochengliu closed 5 months ago

Yochengliu commented 5 months ago

Great work. I try to quickly validate the effectiveness of the network by "overfitting" on a small re10k subset, and the results seem to be less than my expectation. I wonder if I miss some key points of your work. Below are the settings.

Dataset: re10k subset Training platform: 4 GPUs of 4090, batch_size=16 with 4 for each GPU. Hyper-params: same as your newly-released codes, I didn't change any.

Training command keys:
+experiment=re10k
data_loader.train.batch_size=4
checkpointing.every_n_train_steps=5000
Test command keys:
 +experiment=re10k
checkpointing.load=outputs/2024-04-15/17-57-20/checkpoints/epoch_1499-step_15000.ckpt
mode=test
dataset/view_sampler=evaluation
test.compute_scores=true

The results: Testing DataLoader 0: 93%|██████████████▊ | 38/41 [00:06<00:00, 6.03it/s] psnr 21.53944028051276
ssim 0.7834970355033875 lpips 0.21417335558094477
encoder: 33 calls, avg. 0.0347950892014937 seconds per call decoder: 99 calls, avg. 0.0010259873939282966 seconds per call

That is, after "overfitting" on re10k subset by 1499 epochs /15000 steps, the model gets a PSNR with 21.54 on this subset (the renderring visualizations are also not good), much less than my expectation. Generally, I expect that the model could reach PSNR~30 after "overfitting on a small subset" by 100 epochs.

diff-gaussian-rasterization-modified is rightly built and installed. I have checked the test results of your released re10k model, which are consistent with Table. 1 (PSNR=26+)

I have referred to issue 14, i.e., the test results are good after large-scale training.

Maybe your proposed model is not suitable for "overfitting" on a small subset, right? But why? If so, it seems counter-intuitive in this field. I prefer it is that I miss some key points. Look forward to your clarification. Thanks.

Attachment is the training log, for your checking. 20240415_175717.log

Yochengliu commented 5 months ago

config settings: outputs/2024-04-15/17-57-20/.hydra/config.yaml

dataset:
  view_sampler:
    name: bounded
    num_target_views: 4
    num_context_views: 2
    min_distance_between_context_views: 45
    max_distance_between_context_views: 192
    min_distance_to_context_views: 0
    warm_up_steps: 150000
    initial_min_distance_between_context_views: 25
    initial_max_distance_between_context_views: 45
  name: re10k
  roots:
  - datasets/re10k
  make_baseline_1: false
  augment: true
  image_shape:
  - 256
  - 256
  background_color:
  - 0.0
  - 0.0
  - 0.0
  cameras_are_circular: false
  baseline_epsilon: 0.001
  max_fov: 100.0
  skip_bad_shape: true
  near: 1.0
  far: 100.0
  baseline_scale_bounds: false
  shuffle_val: true
  test_len: -1
  test_chunk_interval: 1
  overfit_to_scene: null
model:
  encoder:
    name: costvolume
    opacity_mapping:
      initial: 0.0
      final: 0.0
      warm_up: 1
    num_depth_candidates: 128
    num_surfaces: 1
    gaussians_per_pixel: 1
    gaussian_adapter:
      gaussian_scale_min: 0.5
      gaussian_scale_max: 15.0
      sh_degree: 4
    d_feature: 128
    visualizer:
      num_samples: 8
      min_resolution: 256
      export_ply: false
    unimatch_weights_path: checkpoints/gmdepth-scale1-resumeflowthings-scannet-5d9d7964.pth
    multiview_trans_attn_split: 2
    costvolume_unet_feat_dim: 128
    costvolume_unet_channel_mult:
    - 1
    - 1
    - 1
    costvolume_unet_attn_res:
    - 4
    depth_unet_feat_dim: 32
    depth_unet_attn_res:
    - 16
    depth_unet_channel_mult:
    - 1
    - 1
    - 1
    - 1
    - 1
    downscale_factor: 4
    shim_patch_size: 4
    wo_depth_refine: false
    wo_cost_volume: false
    wo_backbone_cross_attn: false
    wo_cost_volume_refine: false
    use_epipolar_trans: false
  decoder:
    name: splatting_cuda
loss:
  mse:
    weight: 1.0
  lpips:
    weight: 0.05
    apply_after_step: 0
wandb:
  project: mvsplat
  entity: placeholder
  name: re10k
  mode: disabled
  id: null
  tags:
  - re10k
  - 256x256
mode: train
data_loader:
  train:
    num_workers: 10
    persistent_workers: true
    batch_size: 4
    seed: 1234
  test:
    num_workers: 4
    persistent_workers: false
    batch_size: 1
    seed: 2345
  val:
    num_workers: 1
    persistent_workers: true
    batch_size: 1
    seed: 3456
optimizer:
  lr: 0.0002
  warm_up_steps: 2000
  cosine_lr: true
checkpointing:
  load: null
  every_n_train_steps: 5000
  save_top_k: -1
  pretrained_model: null
train:
  depth_mode: null
  extended_visualization: false
  print_log_every_n_steps: 1
test:
  output_path: outputs/test
  compute_scores: true
  eval_time_skip_steps: 5
  save_image: true
  save_video: false
seed: 111123
trainer:
  max_steps: 300001
  val_check_interval: 0.5
  gradient_clip_val: 0.5
  num_sanity_val_steps: 2
output_dir: null
donydchen commented 5 months ago

Hi @Yochengliu, thanks for your interest in our work.

I just skimmed through your provided training log, and it looks good to me. The training bug should have been correctly fixed after commit 297338f54d74e7beb4ca5e0700dee22090b836a4. I have triple-checked the training on our environment, which is also further confirmed by https://github.com/donydchen/mvsplat/issues/14.

I have not trained on the tiny subset, but I feel the results you get look reasonable to me. MVSplat is a feed-forward approach: it needs to learn how to infer 3D Gaussian properties from 2D input images, which I believe cannot be easily achieved by only learning from a few scenes. This is not counter intuitive. The model contains around 12 million parameters, which might be hard, if not impossible, to correctly train with only ~100 scenes. In other words, the model is expected to get worse results if it is trained with a much smaller training set. Note that the training and testing scenes in the subsets are completely different, it seems weird why it can be referred to as 'overfitting', though.

Some suggestions might be helpful for debugging:

Let me know if you encounter any other difficulties, I will try my best to provide some helpful information.

Yochengliu commented 5 months ago

@donydchen Sorry for misleading. I clarify the setting of "overfitting".

Training and test are both carried on re10k subset, that is, it is the same as your understanding "For 'overfitting', my understanding is that it should be trained and tested on the same small group of scenes".

More specifically, test folder and train folder are both the subset above:

截屏2024-04-16 12 00 53

So I think the results are unreasonable. Some key points may be missed.

donydchen commented 5 months ago

If I remember correctly, the default 'train' and 'test' folders in the sub dataset actually contain different scenes (although they happen to be both containing 3 torch files...). You can quickly verify this by checking the 'index.json' inside both folders, their scene names should be different.

A better way to ensure using the same group of scenes might be: rename the 'test' folder and soft link (or copy) the 'train' to 'test'. Then retrain and retest the model.

Yochengliu commented 5 months ago

@donydchen I guess it must be the key point on re10k, retraining after copying 'test' folder is now being carried out. Results will be shared here in the next few hours.

[results to be added] epoch_1363-step_15000.ckpt

Testing DataLoader 0: 2%|██▏ | 1/41 [00:01<01:00, 0.66it/s]Testing DataLoader 0: 93%|██████████████████████████████████████████████████████████████████████████████████▍ | 38/41 [00:06<00:00, 6.07it/s] psnr 23.438872462824772 ssim 0.8216209254766765 lpips 0.17692681441181585 encoder: 33 calls, avg. 0.0345275185324929 seconds per call decoder: 99 calls, avg. 0.0010114390440661496 seconds per call

PSNR=23.44 on 'test' folder after overfitting 'test' folder (I copy 'test' to 'train') by 1363 epochs. The results are indeed improved but still far from my expectation, how do you think?

If all things are right, we can conclude that: the proposed model's generalization ability after large-scale training, is far better than its overfitting ability. Generalization PSNR is 26+ of your released model on subset 'test' folder, long-long-overfitting PSNR is 23+ on the same small subset 'test' folder. This is really an interesting finding, maybe underlying valuable point lies somewhere :)

BTW, I have checked your reflect_augmentation. I think this simple augmentation is not the point that influences overfitting so much. And the overfitting steps of 15000 is enough in my opinion.


Actually, my earlier question before the overfitting experiment on re10k, is the experiment on DTU overfitting last week. The DTU overfitting results were too much worse, so I try to overfit re10k strictly following your released codes to find some missing key points.

I randomly select 6 scenes from DTU, and carry out an overfitting training on this tinyset. The 'train' and 'test' folders are strictly the same, with torch files generated by your convert_dtu.py. Other settings are almost the same as overfitting on re10k mentioned above. The overfitting PSNR=16.47 after 4999 epochs = 10000steps, renderring visualization looks like: 000001 000002 000003

I referred to all issues and found that you were cleaning the codes on DTU two or three weeks ago, are they ready? I guess the hyper-params on DTU need to be tuned, and I have changed the near and far depth to 2.125 and 4.525. Any suggestions?

Training log on DTU: 20240412_192456.log

config.yaml on DTU:

dataset:
  view_sampler:
    name: bounded
    num_target_views: 4
    num_context_views: 2
    min_distance_between_context_views: 2
    max_distance_between_context_views: 10
    min_distance_to_context_views: 0
    warm_up_steps: 0
    initial_min_distance_between_context_views: 0
    initial_max_distance_between_context_views: 0
  name: dtu
  roots:
  - datasets/dtu
  make_baseline_1: false
  augment: false
  image_shape:
  - 256
  - 256
  background_color:
  - 0.0
  - 0.0
  - 0.0
  cameras_are_circular: false
  baseline_epsilon: 0.001
  max_fov: 100.0
  skip_bad_shape: true
  near: 2.125
  far: 4.525
  baseline_scale_bounds: false
  shuffle_val: true
  test_len: -1
  test_chunk_interval: 1
  overfit_to_scene: null
model:
  encoder:
    name: costvolume
    opacity_mapping:
      initial: 0.0
      final: 0.0
      warm_up: 1
    num_depth_candidates: 128
    num_surfaces: 1
    gaussians_per_pixel: 1
    gaussian_adapter:
      gaussian_scale_min: 0.5
      gaussian_scale_max: 15.0
      sh_degree: 4
    d_feature: 128
    visualizer:
      num_samples: 8
      min_resolution: 256
      export_ply: false
    unimatch_weights_path: checkpoints/gmdepth-scale1-resumeflowthings-scannet-5d9d7964.pth
    multiview_trans_attn_split: 2
    costvolume_unet_feat_dim: 128
    costvolume_unet_channel_mult:
    - 1
    - 1
    - 1
    costvolume_unet_attn_res:
    - 4
    depth_unet_feat_dim: 32
    depth_unet_attn_res:
    - 16
    depth_unet_channel_mult:
    - 1
    - 1
    - 1
    - 1
    - 1
    downscale_factor: 4
    shim_patch_size: 4
    wo_depth_refine: false
    wo_cost_volume: false
    wo_backbone_cross_attn: false
    wo_cost_volume_refine: false
    use_epipolar_trans: false
  decoder:
    name: splatting_cuda
loss:
  mse:
    weight: 1.0
  lpips:
    weight: 0.05
    apply_after_step: 0
wandb:
  project: mvsplat
  entity: placeholder
  name: dtu
  mode: disabled
  id: null
  tags:
  - dtu
  - 1024x1024
mode: train
data_loader:
  train:
    num_workers: 10
    persistent_workers: true
    batch_size: 4
    seed: 1234
  test:
    num_workers: 4
    persistent_workers: false
    batch_size: 1
    seed: 2345
  val:
    num_workers: 1
    persistent_workers: true
    batch_size: 1
    seed: 3456
optimizer:
  lr: 0.0005
  warm_up_steps: 1000
  cosine_lr: true
checkpointing:
  load: null
  every_n_train_steps: 2000
  save_top_k: -1
  pretrained_model: null
train:
  depth_mode: null
  extended_visualization: false
  print_log_every_n_steps: 1
test:
  output_path: outputs/test
  compute_scores: true
  eval_time_skip_steps: 5
  save_image: true
  save_video: false
seed: 111123
trainer:
  max_steps: 10000
  val_check_interval: 0.5
  gradient_clip_val: 0.5
  num_sanity_val_steps: 2
output_dir: null
donydchen commented 5 months ago

Hi @Yochengliu, interesting findings. We mainly spent our efforts on learning from large-scale datasets and did not experiment on tiny datasets. You are more than welcome to build upon our MVSplat and find some better solutions to excel on smaller-scale datasets. I believe it will also be very helpful for the community.

I will also try to run some related experiments (if I have time). By the way, one more thing you might try is: tune the frame distance update strategy. For example, you can set dataset.view_sampler.warm_up_steps=7500 (which by default is 150000), which will make it quicker to reach the largest context view frame distances, as those distances might be rather larger in the test scenario. You can also consider correcting the distance bug following pixelSplat's update here. We keep the bug untouched (as noted here) to maintain a fair comparison with pixelSplat's reported scores.


I just updated the codes and instructions for DTU at commit b999e4b7f0a94387960d8ec752abe6bb060da6d9. I'm the culprit for missing my previous promise, sorry but I've been swamped with another project recently. We only use DTU to do cross-dataset testing, though, and have not run any training on it.

One suggestion for training on DTU is that you might consider building up a different view sampler. For training on RE10K, the view sampling strategy (as in view_sampler_bounded.py) is to randomly select 4 target novel views between the 2 input context views. This is reasonable since the RE10K data is extracted from videos, sampling in such a way is more likely to ensure that target views are bound by the context views. But using the same sampler on DTU cannot guarantee such a property. Therefore, it will make the training unstable if a large portion of the target views are not visible in the input views. Perhaps a better way would be to select the nearest views of the target views as the input context views, similar to what have been done in our MuRF or MatchNeRF.

Yochengliu commented 5 months ago

@donydchen Thanks for your quick response and good suggestions. I have some new conclusions below.

The model can not overfit well on tinyset, which is somewhat reasonable. Because the setting does not strictly meet the bar of real overfitting. To be specific, the test views (context and target) in your assets/evaluation_index_re10k.json are possibly not seen strictly in the training process. So my setting requires that the model is of 'novel view synthesis' ability (to some extent) on tinyset scenes after 'overfitting' on these scenes. This is beyond the proposed work, which focuses on the novel scene generalization after large-scale cross-scene training.

Of course, we in fact expect that the proposed model can both do well in novel scene generalization and novel view synthesis (e.g., single-scene training and get good novel view renderring results in this single scene). I think only in this way, the proposed pipeline would be practical and produce bigger impact to the community. Because you could expect a good generation result after only training with a few scenes (e.g., 30, 50, 100).

I close the issue and welcome to reopen it if anyone has some insights or discussions.

Look forward to seeing your new great projects :)