Large indoor environments

ArtiKitten commented 1 year ago

Hi, I'm currently working on reconstructing large indoor environments. From what I understood, the batch size is what's important in the reconstruction of this kind of scene.

I'm testing the setup with the meeting room from Tanks and Temples with a down sample of 30 (~370 images).

My first try is on a Quadro8000, so 48GB of memory, and with the recommended config (dict_size=22, dim=8, batch_size=16), it won't even start training. The only way the training doesn't fail is by running it with a batch_size of 4 and it does 1.25 it/s. Achieving 500 000 iterations would take quite literally almost a week of training. It then crashed at iteration 10 000, the first checkpoint.

Epoch: 107, total time: 112.668755.
Epoch: 108, total time: 111.452045.
Evaluating with 4 samples.
Training epoch 109:  68%|███████████████████████████▍            | 63/92 [01:52<00:23,  1.23it/s, iter=1e+4][2023-11-16 13:49:58,752] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 1100) of binary: /home/devops/miniconda3/envs/neuralangelo-rl/bin/python
Traceback (most recent call last):
  File "/home/devops/miniconda3/envs/neuralangelo-rl/bin/torchrun", line 10, in <module>
    sys.exit(main())
  File "/home/devops/miniconda3/envs/neuralangelo-rl/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/devops/miniconda3/envs/neuralangelo-rl/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/devops/miniconda3/envs/neuralangelo-rl/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/devops/miniconda3/envs/neuralangelo-rl/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/devops/miniconda3/envs/neuralangelo-rl/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-16_13:49:58
  host      : chercheurs28.cdrin.com
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 1100)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 1100
=====================================================

And for the configuration file,

checkpoint:
    save_epoch: 9999999999
    save_iter: 10000
    save_latest_iter: 9999999999
    save_period: 9999999999
    strict_resume: true
cudnn:
    benchmark: true
    deterministic: false
data:
    name: dummy
    num_images: null
    num_workers: 4
    preload: true
    readjust:
        center:
        - 0.0
        - 0.0
        - 0.0
        scale: 1.0
    root: datasets/meeting_room_ds30
    train:
        batch_size: 4
        image_size:
        - 2174
        - 3931
        subset: null
    type: projects.neuralangelo.data
    use_multi_epoch_loader: true
    val:
        batch_size: 4
        image_size:
        - 300
        - 542
        max_viz_samples: 16
        subset: 4
image_save_iter: 9999999999
inference_args: {}
local_rank: 0
logdir: logs/indoor/meeting_room
logging_iter: 9999999999999
max_epoch: 9999999999
max_iter: 1000000
metrics_epoch: null
metrics_iter: null
model:
    appear_embed:
        dim: 8
        enabled: false
    background:
        enabled: false
        encoding:
            levels: 10
            type: fourier
        encoding_view:
            levels: 3
            type: spherical
        mlp:
            activ: relu
            activ_density: softplus
            activ_density_params: {}
            activ_params: {}
            hidden_dim: 256
            hidden_dim_rgb: 128
            num_layers: 8
            num_layers_rgb: 2
            skip:
            - 4
            skip_rgb: []
        view_dep: true
        white: false
    object:
        rgb:
            encoding_view:
                levels: 3
                type: spherical
            mlp:
                activ: relu_
                activ_params: {}
                hidden_dim: 256
                num_layers: 4
                skip: []
                weight_norm: true
            mode: idr
        s_var:
            anneal_end: 0.1
            init_val: 3.0
        sdf:
            encoding:
                coarse2fine:
                    enabled: true
                    init_active_level: 8
                    step: 5000
                hashgrid:
                    dict_size: 22
                    dim: 8
                    max_logres: 11
                    min_logres: 5
                    range:
                    - -2
                    - 2
                levels: 16
                type: hashgrid
            gradient:
                mode: numerical
                taps: 4
            mlp:
                activ: softplus
                activ_params:
                    beta: 100
                geometric_init: true
                hidden_dim: 256
                inside_out: true
                num_layers: 1
                out_bias: 0.5
                skip: []
                weight_norm: true
    render:
        num_sample_hierarchy: 4
        num_samples:
            background: 0
            coarse: 64
            fine: 16
        rand_rays: 512
        stratified: true
    type: projects.neuralangelo.model
nvtx_profile: false
optim:
    fused_opt: false
    params:
        lr: 0.001
        weight_decay: 0.01
    sched:
        gamma: 10.0
        iteration_mode: true
        step_size: 9999999999
        two_steps:
        - 300000
        - 400000
        type: two_steps_with_warmup
        warm_up_end: 5000
    type: AdamW
pretrained_weight: null
source_filename: projects/neuralangelo/configs/custom/meeting_room.yaml
speed_benchmark: false
test_data:
    name: dummy
    num_workers: 0
    test:
        batch_size: 1
        is_lmdb: false
        roots: null
    type: imaginaire.datasets.images
timeout_period: 9999999
trainer:
    amp_config:
        backoff_factor: 0.5
        enabled: false
        growth_factor: 2.0
        growth_interval: 2000
        init_scale: 65536.0
    ddp_config:
        find_unused_parameters: false
        static_graph: true
    depth_vis_scale: 0.5
    ema_config:
        beta: 0.9999
        enabled: false
        load_ema_checkpoint: false
        start_iteration: 0
    grad_accum_iter: 1
    image_to_tensorboard: false
    init:
        gain: null
        type: none
    loss_weight:
        curvature: 0.0005
        eikonal: 0.1
        render: 1.0
    type: projects.neuralangelo.trainer
validation_iter: 5000
wandb_image_iter: 10000
wandb_scalar_iter: 100

That makes me wonder how we can achieve the same result as you show in the paper with large indoor environment? Do you have any example of a config file to achieve this and the corresponding GPUs?

Maybe I'm missing something and I don't understand what I'm doing? Maybe I just need $150k worth of GPUs?

Thanks for your help!

EDIT: I had the chance to test 2x3090 gpus and I still can't train with 16 or even 8 as batch_size.

Bombaninha commented 1 year ago

From what I understood, the batch size is what's important in the reconstruction of this kind of scene.

Maybe that's not exactly the point; the recommendations made are only related to the hyperparameters dict_size and dim of the hashgrid.

Your configuration could be the one shown in projects/neuralangelo/configs/tnt.yaml.

_parent_: projects/neuralangelo/configs/base.yaml

model:
    object:
        sdf:
            mlp:
                inside_out: False   # True for Meetingroom.
            encoding:
                coarse2fine:
                    init_active_level: 8
    appear_embed:
        enabled: True
        dim: 8

data:
    type: projects.neuralangelo.data
    root: datasets/tanks_and_temples/Barn
    num_images: 410  # The number of training images.
    train:
        image_size: [835,1500]
        batch_size: 1
        subset:
    val:
        image_size: [300,540]
        batch_size: 1
        subset: 1
        max_viz_samples: 16

In your case, setting inside_out = True may also be helpful.

Take a look at this document for experimental details: Supplementary

ArtiKitten commented 1 year ago

The high batch size is what they mentionned using in the supplementary paper on the project, so 16 for T&T. I trained meeting room on a 2x3090 setup during the whole weekend, for a total of 70h (250 000 iterations).

dict_size =22
dim=8
batch_size=2

Since I wasn't working during the weekend, I didn't notice the model stopped imporving at around 100k iterations.

wandb loss graphes

50000 50k iterations

70000 70k iterations

100000 100k iterations

250000 250k iterations

As we can see, no difference after 100k.

Bombaninha commented 1 year ago

Actually, I found what you mentioned in A. Additional Hyper-parameter section.

For the DTU benchmark, we follow prior work [14–16] and use a batch size of 1. For the Tanks and Temples dataset, we use a batch size of 16. We use the marching cubes algorithm [5] to convert predicted SDF to triangular meshes. The marching cubes resolution is set to 512 for the DTU benchmark following prior work [1, 14–16] and 2048 for the Tanks and Temples dataset

There are differences after 100k iterations, but perhaps not so representative.

If I were in your position, I would choose to merge the configuration you've already used but would also incorporate those adjustments I had mentioned earlier regarding the Signed Distance Function (SDF).

Please keep me updated of your results.

Best regards, Lucas.

NVlabs / neuralangelo

Large indoor environments #154