Error at training epoch 1: ValueError: Training loss has gone to NaN!!!

XinyueZ commented 1 year ago

Initialize model weights using type: none, gain: None
Using random seed 0
Allow TensorFloat32 operations on supported devices
Train dataset length: 2428                                                                                                         
Val dataset length: 4                                                                                                              
Training from scratch.
Initialize wandb
Evaluating with 4 samples.                                                                                                         
Traceback (most recent call last):                                                                                                 
  File "/workspace/train.py", line 104, in <module>
    main()
  File "/workspace/train.py", line 93, in main
    trainer.train(cfg,
  File "/workspace/projects/neuralangelo/trainer.py", line 107, in train
    super().train(cfg, data_loader, single_gpu, profile, show_pbar)
  File "/workspace/projects/nerf/trainers/base.py", line 115, in train
    super().train(cfg, data_loader, single_gpu, profile, show_pbar)
  File "/workspace/imaginaire/trainers/base.py", line 512, in train
    self.end_of_iteration(data, current_epoch, current_iteration)
  File "/workspace/imaginaire/trainers/base.py", line 319, in end_of_iteration
    self._end_of_iteration(data, current_epoch, current_iteration)
  File "/workspace/projects/nerf/trainers/base.py", line 51, in _end_of_iteration
    raise ValueError("Training loss has gone to NaN!!!")
ValueError: Training loss has gone to NaN!!!
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 684345) of binary: /usr/bin/python3.10
Traceback (most recent call last):
  File "/home/user/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-02_09:26:18
  host      : fa8a3506d0d7
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 684345)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

mli0603 commented 1 year ago

Hi @XinyueZ

It looks like the training directly diverges. Based on our experience, this usually relates to wrong pre-processing. Could you provide more details on how you performed this step?

XinyueZ commented 1 year ago

@mli0603 nothing special, just follow the colab.

mli0603 commented 1 year ago

Hi @XinyueZ

It is worth noting that the default configs of the toy example in colab may not be applicable for all scenarios. More details can be found in the data preprocessing document :)

Adolfhill commented 1 year ago

I also encountered this issue. When I ran visualize_colmap.ipynb, I found that the bounding sphere did not cover all the trajectory curves. Is this the reason for the error? Should the bounding sphere cover all trajectory curves and/or enough points?

mli0603 commented 1 year ago

Hi @Adolfhill

The bounding region does not necessarily cover all the cameras but should cover the region of interest. You can also see the scene type discussion in #110.

chenhsuanlin commented 1 year ago

Closing due to inactivity, please feel free to reopen if there are further issues.

NVlabs / neuralangelo

Error at training epoch 1: ValueError: Training loss has gone to NaN!!! #102