NVlabs / neuralangelo

Official implementation of "Neuralangelo: High-Fidelity Neural Surface Reconstruction" (CVPR 2023)
https://research.nvidia.com/labs/dir/neuralangelo/
Other
4.39k stars 389 forks source link

Colab demo doesn't work #210

Open Rajat-Vishwa opened 2 months ago

Rajat-Vishwa commented 2 months ago

Running the colab example initially gives #205. (COLMAP fails to execute)

205 is solved by adding the following before installing COLMAP,

!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
!mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
!wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu2204-11-7-local_11.7.0-515.43.04-1_amd64.deb
!dpkg -i cuda-repo-ubuntu2204-11-7-local_11.7.0-515.43.04-1_amd64.deb
!cp /var/cuda-repo-ubuntu2204-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/
!apt-get update
!apt-get -y install cuda-11-7
!update-alternatives --set cuda /usr/local/cuda-11.7

This fixes COLMAP and it is able to run the preprocessing untill it throws error on the training step,

# @title { vertical-output: true }
%cd /content/neuralangelo
GROUP = "test_exp"
NAME = "lego"
!torchrun --nproc_per_node=1 train.py \
    --logdir=logs/{GROUP}/{NAME} \
    --show_pbar \
    --config=projects/neuralangelo/configs/custom/lego.yaml \
    --data.readjust.scale=0.5 \
    --max_iter=20000 \
    --validation_iter=99999999 \
    --model.object.sdf.encoding.coarse2fine.step=200 \
    --model.object.sdf.encoding.hashgrid.dict_size=19 \
    --optim.sched.warm_up_end=200 \
    --optim.sched.two_steps=[12000,16000]

ERROR :

[W829 13:19:57.835930806 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
Training with 1 GPUs.
Using random seed 0
Make folder logs/test_exp/lego
* checkpoint:
   * save_epoch: 9999999999
   * save_iter: 20000
   * save_latest_iter: 9999999999
   * save_period: 9999999999
   * strict_resume: True
* cudnn:
   * benchmark: True
   * deterministic: False
* data:
   * name: dummy
   * num_images: None
   * num_workers: 4
   * preload: True
   * readjust:
      * center: [0.0, 0.0, 0.0]
      * scale: 0.5
   * root: datasets/lego_ds2
   * train:
      * batch_size: 2
      * image_size: [801, 801]
      * subset: None
   * type: projects.neuralangelo.data
   * use_multi_epoch_loader: True
   * val:
      * batch_size: 2
      * image_size: [300, 300]
      * max_viz_samples: 16
      * subset: 4
* image_save_iter: 9999999999
* inference_args:
* local_rank: 0
* logdir: logs/test_exp/lego
* logging_iter: 9999999999999
* max_epoch: 9999999999
* max_iter: 20000
* metrics_epoch: None
* metrics_iter: None
* model:
   * appear_embed:
      * dim: 8
      * enabled: False
   * background:
      * enabled: True
      * encoding:
         * levels: 10
         * type: fourier
      * encoding_view:
         * levels: 3
         * type: spherical
      * mlp:
         * activ: relu
         * activ_density: softplus
         * activ_density_params:
         * activ_params:
         * hidden_dim: 256
         * hidden_dim_rgb: 128
         * num_layers: 8
         * num_layers_rgb: 2
         * skip: [4]
         * skip_rgb: []
      * view_dep: True
      * white: False
   * object:
      * rgb:
         * encoding_view:
            * levels: 3
            * type: spherical
         * mlp:
            * activ: relu_
            * activ_params:
            * hidden_dim: 256
            * num_layers: 4
            * skip: []
            * weight_norm: True
         * mode: idr
      * s_var:
         * anneal_end: 0.1
         * init_val: 3.0
      * sdf:
         * encoding:
            * coarse2fine:
               * enabled: True
               * init_active_level: 4
               * step: 200
            * hashgrid:
               * dict_size: 19
               * dim: 8
               * max_logres: 11
               * min_logres: 5
               * range: [-2, 2]
            * levels: 16
            * type: hashgrid
         * gradient:
            * mode: numerical
            * taps: 4
         * mlp:
            * activ: softplus
            * activ_params:
               * beta: 100
            * geometric_init: True
            * hidden_dim: 256
            * inside_out: False
            * num_layers: 1
            * out_bias: 0.5
            * skip: []
            * weight_norm: True
   * render:
      * num_sample_hierarchy: 4
      * num_samples:
         * background: 32
         * coarse: 64
         * fine: 16
      * rand_rays: 512
      * stratified: True
   * type: projects.neuralangelo.model
* nvtx_profile: False
* optim:
   * fused_opt: False
   * params:
      * lr: 0.001
      * weight_decay: 0.01
   * sched:
      * gamma: 10.0
      * iteration_mode: True
      * step_size: 9999999999
      * two_steps: [12000, 16000]
      * type: two_steps_with_warmup
      * warm_up_end: 200
   * type: AdamW
* pretrained_weight: None
* source_filename: projects/neuralangelo/configs/custom/lego.yaml
* speed_benchmark: False
* test_data:
   * name: dummy
   * num_workers: 0
   * test:
      * batch_size: 1
      * is_lmdb: False
      * roots: None
   * type: imaginaire.datasets.images
* timeout_period: 9999999
* trainer:
   * amp_config:
      * backoff_factor: 0.5
      * enabled: False
      * growth_factor: 2.0
      * growth_interval: 2000
      * init_scale: 65536.0
   * ddp_config:
      * find_unused_parameters: False
      * static_graph: True
   * depth_vis_scale: 0.5
   * ema_config:
      * beta: 0.9999
      * enabled: False
      * load_ema_checkpoint: False
      * start_iteration: 0
   * grad_accum_iter: 1
   * image_to_tensorboard: False
   * init:
      * gain: None
      * type: none
   * loss_weight:
      * curvature: 0.0005
      * eikonal: 0.1
      * render: 1.0
   * type: projects.neuralangelo.trainer
* validation_iter: 99999999
* wandb_image_iter: 10000
* wandb_scalar_iter: 100
cudnn benchmark: True
cudnn deterministic: False
Setup trainer.
Using random seed 0
[rank0]: Traceback (most recent call last):
[rank0]:   File "/content/neuralangelo/train.py", line 104, in <module>
[rank0]:     main()
[rank0]:   File "/content/neuralangelo/train.py", line 79, in main
[rank0]:     trainer = get_trainer(cfg, is_inference=False, seed=args.seed)
[rank0]:   File "/content/neuralangelo/imaginaire/trainers/utils/get_trainer.py", line 32, in get_trainer
[rank0]:     trainer = trainer_lib.Trainer(cfg, is_inference=is_inference, seed=seed)
[rank0]:   File "/content/neuralangelo/projects/neuralangelo/trainer.py", line 26, in __init__
[rank0]:     super().__init__(cfg, is_inference=is_inference, seed=seed)
[rank0]:   File "/content/neuralangelo/projects/nerf/trainers/base.py", line 28, in __init__
[rank0]:     super().__init__(cfg, is_inference=is_inference, seed=seed)
[rank0]:   File "/content/neuralangelo/imaginaire/trainers/base.py", line 50, in __init__
[rank0]:     self.model = self.setup_model(cfg, seed=seed)
[rank0]:   File "/content/neuralangelo/imaginaire/trainers/base.py", line 116, in setup_model
[rank0]:     lib_model = importlib.import_module(cfg.model.type)
[rank0]:   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank0]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank0]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank0]:   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank0]:   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank0]:   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank0]:   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank0]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank0]:   File "/content/neuralangelo/projects/neuralangelo/model.py", line 21, in <module>
[rank0]:     from projects.neuralangelo.utils.modules import NeuralSDF, NeuralRGB, BackgroundNeRF
[rank0]:   File "/content/neuralangelo/projects/neuralangelo/utils/modules.py", line 16, in <module>
[rank0]:     import tinycudann as tcnn
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/tinycudann/__init__.py", line 9, in <module>
[rank0]:     from tinycudann.modules import free_temporary_memory, NetworkWithInputEncoding, Network, Encoding
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/tinycudann/modules.py", line 51, in <module>
[rank0]:     _C = importlib.import_module(f"tinycudann_bindings._{cc}_C")
[rank0]:   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank0]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank0]: ImportError: /usr/local/lib/python3.10/dist-packages/tinycudann_bindings/_75_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
E0829 13:20:04.141000 139155706921600 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 31457) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-29_13:20:04
  host      : a8e8c22c1e57
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 31457)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Tried switching the cuda versions by using

!sudo update-alternatives --config cuda 
There are 3 choices for the alternative cuda (providing /usr/local/cuda).

  Selection    Path                  Priority   Status
------------------------------------------------------------
  0            /usr/local/cuda-12.2   122       auto mode
  1            /usr/local/cuda-11.7   117       manual mode
* 2            /usr/local/cuda-11.8   118       manual mode
  3            /usr/local/cuda-12.2   122       manual mode

Press <enter> to keep the current choice[*], or type selection number: 2

But it is still doesn't work.

amrzv commented 2 months ago

Colab notebook is not accessible, requires an access.

Rajat-Vishwa commented 2 months ago

Colab notebook is not accessible, requires an access.

Seems like they took it down. You can a find a copy of the original demo notebook here.