Error wandb - Githubissues

I'm trying to run Neuralangelo with the test set "lego," but I haven't been able to get past the point where I invoke the command:

torchrun --nproc_per_node=${GPUS} train.py \ --logdir=logs/${GROUP}/${NAME} \ --config=${CONFIG} \ --show_pbar

This command throws an error for which I haven't been able to find a solution. I've tried changing many of the parameters in the project's files, but I still can't find a fix. Below is the error I'm encountering, in case anyone has a solution.

Thank you.

Error: torchrun --nproc_per_node=${GPUS} train.py --logdir=logs/${GROUP}/${NAME} --config=${CONFIG} --show_pbar (Setting affinity with NVML failed, skipping...) [W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) Training with 1 GPUs. Using random seed 0 Make folder logs/example_group/example_name

checkpoint:
- save_epoch: 9999999999
- save_iter: 5000
- save_latest_iter: 9999999999
- save_period: 9999999999
- strict_resume: True
cudnn:
- benchmark: True
- deterministic: False
data:
- name: dummy
- num_images: None
- num_workers: 4
- preload: True
- readjust:
  - center: [0.0, 0.0, 0.0]
  - scale: 1.0
- root: datasets/lego_ds2
- train:
  - batch_size: 2
  - image_size: [801, 801]
  - subset: None
- type: projects.neuralangelo.data
- use_multi_epoch_loader: True
- val:
  - batch_size: 2
  - image_size: [300, 300]
  - max_viz_samples: 16
  - subset: 4
image_save_iter: 9999999999
inference_args:
local_rank: 0
logdir: logs/example_group/example_name
logging_iter: 9999999999999
max_epoch: 9999999999
max_iter: 500000
metrics_epoch: None
metrics_iter: None
model:
- appear_embed:
  - dim: 4
  - enabled: False
- background:
  - enabled: True
  - encoding:
    - levels: 10
    - type: fourier
  - encoding_view:
    - levels: 3
    - type: spherical
  - mlp:
    - activ: relu
    - activ_density: softplus
    - activ_density_params:
    - activ_params:
    - hidden_dim: 256
    - hidden_dim_rgb: 128
    - num_layers: 8
    - num_layers_rgb: 2
    - skip: [4]
    - skip_rgb: []
  - view_dep: True
  - white: False
- object:
  - rgb:
    - encoding_view:
    - levels: 3
    - type: spherical
    - mlp:
    - activ: relu_
    - activ_params:
    - hidden_dim: 256
    - num_layers: 4
    - skip: []
    - weight_norm: True
    - mode: idr
  - s_var:
    - anneal_end: 0.1
    - init_val: 3.0
  - sdf:
    - encoding:
    - coarse2fine:
      - enabled: True
      - init_active_level: 4
      - step: 5000
    - hashgrid:
      - dict_size: 21
      - dim: 4
      - max_logres: 11
      - min_logres: 5
      - range: [-2, 2]
    - levels: 16
    - type: hashgrid
    - gradient:
    - mode: numerical
    - taps: 4
    - mlp:
    - activ: softplus
    - activ_params:
      - beta: 100
    - geometric_init: True
    - hidden_dim: 256
    - inside_out: False
    - num_layers: 1
    - out_bias: 0.5
    - skip: []
    - weight_norm: True
- render:
  - num_sample_hierarchy: 4
  - num_samples:
    - background: 32
    - coarse: 64
    - fine: 16
  - rand_rays: 512
  - stratified: True
- type: projects.neuralangelo.model
nvtx_profile: False
optim:
- fused_opt: False
- params:
  - lr: 0.001
  - weight_decay: 0.01
- sched:
  - gamma: 10.0
  - iteration_mode: True
  - step_size: 9999999999
  - two_steps: [300000, 400000]
  - type: two_steps_with_warmup
  - warm_up_end: 5000
- type: AdamW
pretrained_weight: None
source_filename: projects/neuralangelo/configs/custom/lego.yaml
speed_benchmark: False
test_data:
- name: dummy
- num_workers: 0
- test:
  - batch_size: 1
  - is_lmdb: False
  - roots: None
- type: imaginaire.datasets.images
timeout_period: 9999999
trainer:
- amp_config:
  - backoff_factor: 0.5
  - enabled: False
  - growth_factor: 2.0
  - growth_interval: 2000
  - init_scale: 65536.0
- ddp_config:
  - find_unused_parameters: False
  - static_graph: True
- depth_vis_scale: 0.5
- ema_config:
  - beta: 0.9999
  - enabled: False
  - load_ema_checkpoint: False
  - start_iteration: 0
- grad_accum_iter: 1
- image_to_tensorboard: False
- init:
  - gain: None
  - type: none
- loss_weight:
  - curvature: 0.0005
  - eikonal: 0.1
  - render: 1.0
- type: projects.neuralangelo.trainer
validation_iter: 5000
wandb_image_iter: 10000
wandb_scalar_iter: 100 cudnn benchmark: True cudnn deterministic: False Setup trainer. Using random seed 0 /home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") model parameter count: 99,705,900 Initialize model weights using type: none, gain: None Using random seed 0 rank0:[W Utils.hpp:108] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString) Allow TensorFloat32 operations on supported devices Train dataset length: 100 Val dataset length: 4 Training from scratch. Initialize wandb rank0: Traceback (most recent call last): rank0: File "/mnt/d/Documents/neuralangelo/train.py", line 104, in rank0: File "/mnt/d/Documents/neuralangelo/train.py", line 85, in main rank0: File "/mnt/d/Documents/neuralangelo/imaginaire/trainers/base.py", line 269, in init_wandb rank0: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_watch.py", line 49, in watch rank0: tel.feature.watch = True rank0: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/telemetry.py", line 42, in exit rank0: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 799, in _telemetry_callback E0822 21:49:57.518840 139941045491520 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 25214) of binary: /home/miguel12/miniconda3/envs/neuralangelo/bin/python Traceback (most recent call last): File "/home/miguel12/miniconda3/envs/neuralangelo/bin/torchrun", line 10, in sys.exit(main()) File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-08-22_21:49:57 host : DESKTOP-Q0DS9I2. rank : 0 (local_rank: 0) exitcode : 1 (pid: 25214) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

NVlabs / neuralangelo

Error wandb #209

train.py FAILED