The loss becomes 'nan' when resuming training from checkpoints.

BoMingZhao commented 1 year ago

Hi, Thank you for open-sourcing such great work. I was training my own outdoor scene and when I resumed training from the last epoch, a bug appeared where the training loss turned into 'nan'. Here are some of my terminal outputs, as well as the training results from the last epoch.

cudnn benchmark: True
cudnn deterministic: False
Setup trainer.
Using random seed 0
model parameter count: 366,729,596
Initialize model weights using type: none, gain: None
Using random seed 0
Allow TensorFloat32 operations on supported devices
Train dataset length: 1487                                                                                                                                                               
Val dataset length: 4                                                                                                                                                                    
Loading checkpoint (local): logs/cambridge/StMarysChurch/epoch_00121_iteration_000090000_checkpoint.pt
- Loading the model...
- Loading the optimizer...
- Loading the scheduler...
Done with loading the checkpoint (epoch 121, iter 90000).
Initialize wandb
wandb: Currently logged in as: bmzhao (bmzhao99). Use `wandb login --relogin` to force relogin
cat: /sys/module/amdgpu/initstate: No such file or directory
ERROR:root:Driver not initialized (amdgpu not found in modules)
wandb: Tracking run with wandb version 0.15.8
wandb: Run data is saved locally in logs/cambridge/StMarysChurch/wandb/run-20230815_182850-7tfevxuo
wandb: Run `wandb offline` to turn off syncing.
wandb: Resuming run StMarysChurch
wandb: ⭐️ View project at https://wandb.ai/bmzhao99/StMarysChurch
wandb: 🚀 View run at https://wandb.ai/bmzhao99/StMarysChurch/runs/7tfevxuo
Evaluating with 4 samples.                                                                                                                                                               
Training epoch 122:  13%|███████████████▎                                                                                                   | 99/743 [00:20<02:14,  4.78it/s, iter=90100]wandb: Waiting for W&B process to finish... (success).
wandb: - 2.931 MB of 2.931 MB uploaded (0.000 MB deduped)
wandb: Run summary:
wandb:                  epoch 134
wandb:              iteration 99600
wandb:               optim/lr 0.001
wandb:             time/epoch 221.29674
wandb:         time/iteration 0.07249
wandb:             train/PSNR 22.44942
wandb:    train/active_levels 16
wandb: train/curvature_weight 5e-05
wandb:   train/eikonal_weight 0.1
wandb:          train/epsilon 0.00049
wandb:   train/loss/curvature 103.02345
wandb:     train/loss/eikonal 0.02801
wandb:      train/loss/render 0.12252
wandb:       train/loss/total 0.13093
wandb:            train/s-var 5.56595
wandb:               val/PSNR 19.31314
wandb:      val/active_levels 16
wandb:   val/curvature_weight 5e-05
wandb:     val/eikonal_weight 0.1
wandb:     val/loss/curvature 97.7878
wandb:       val/loss/eikonal 0.03254
wandb:        val/loss/render 0.07198
wandb:         val/loss/total 0.08056
wandb:              val/s-var 5.54685
wandb: 
wandb: 🚀 View run StMarysChurch at: https://wandb.ai/bmzhao99/StMarysChurch/runs/7tfevxuo
wandb: Synced 3 W&B file(s), 6 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: logs/cambridge/StMarysChurch/wandb/run-20230815_182850-7tfevxuo/logs
/home/zhaoboming/anaconda3/envs/neuralangelo/lib/python3.8/site-packages/wandb/sdk/wandb_run.py:2089: UserWarning: Run (7tfevxuo) is finished. The call to `_console_raw_callback` will be ignored. Please make sure that you are using an active run.
  lambda data: self._console_raw_callback("stderr", data),
Traceback (most recent call last):                                                                                                                                                       
  File "train.py", line 104, in <module>
    main()
  File "train.py", line 93, in main
    trainer.train(cfg,
  File "/mnt/data1/zhaoboming/neuralangelo/projects/neuralangelo/trainer.py", line 106, in train
    super().train(cfg, data_loader, single_gpu, profile, show_pbar)
  File "/mnt/data1/zhaoboming/neuralangelo/projects/nerf/trainers/base.py", line 115, in train
    super().train(cfg, data_loader, single_gpu, profile, show_pbar)
  File "/mnt/data1/zhaoboming/neuralangelo/imaginaire/trainers/base.py", line 511, in train
    self.end_of_iteration(data, current_epoch, current_iteration)
  File "/mnt/data1/zhaoboming/neuralangelo/imaginaire/trainers/base.py", line 319, in end_of_iteration
    self._end_of_iteration(data, current_epoch, current_iteration)
  File "/mnt/data1/zhaoboming/neuralangelo/projects/nerf/trainers/base.py", line 51, in _end_of_iteration
    raise ValueError("Training loss has gone to NaN!!!")
ValueError: Training loss has gone to NaN!!!

Here are my training results.

AuthorityWang commented 1 year ago

The same problem here, when resuming the training from the checkpoint, the loss went to NaN. 😢 It seems that only the outside network weights were loaded while the center part did not. rgb_render_59901_121a089831c0861b592f

mli0603 commented 1 year ago

Hi @BoMingZhao @AuthorityWang

Thanks for reporting this. This is probably a bug and let me look into where the problem is.

mli0603 commented 1 year ago

Hi @BoMingZhao @AuthorityWang

I set the checkpoint.save_iter (the line below) to 2k so I can frequently inspect the results. I have not been able to reproduce the issue with early training iterations.

https://github.com/NVlabs/neuralangelo/blob/e398a3bdc841448b75ebbed64935ac7499fcd82d/projects/neuralangelo/configs/base.yaml#L21

Could you help me pin down which loss is NaN with your existing checkpoints?

BoMingZhao commented 1 year ago

Could you help me pin down which loss is NaN with your existing checkpoints?

Sorry for replying late. I've been training another dataset these past few days and didn't encounter the 'loss nan' issue. I'm now trying to retrain the dataset where the bug previously appeared, hoping to reproduce the problem.

BoMingZhao commented 1 year ago

Hi @BoMingZhao @AuthorityWang

I set the checkpoint.save_iter (the line below) to 2k so I can frequently inspect the results. I have not been able to reproduce the issue with early training iterations.

https://github.com/NVlabs/neuralangelo/blob/e398a3bdc841448b75ebbed64935ac7499fcd82d/projects/neuralangelo/configs/base.yaml#L21

Could you help me pin down which loss is NaN with your existing checkpoints?

@mli0603 Hi, I find the render loss is nan, while both the eikonal loss and curvature loss are 0.

1692331296549

440981 commented 1 year ago

Have you solved it? I have the same problem

mli0603 commented 1 year ago

@BoMingZhao @AuthorityWang

We have pushed a commit that potentially fixes the issue of resuming (https://github.com/NVlabs/neuralangelo/commit/c91af8d5098c858df8e8dfa35fba8666d314782b). Please let us know if you still run into the same problem.

If the issue does not appear anymore, feel free to close this.

joshpwrk commented 10 months ago

I got a simlar issue but on run 290,000.

/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_run.py:2089: UserWarning: Run (pqmw9w2j) is finished. The call to `_console_raw_callback` will be ignored. Please make sure that you are using an active run.
  lambda data: self._console_raw_callback("stderr", data),
Traceback (most recent call last):                                                                                                                          
  File "train.py", line 104, in <module>
    main()
  File "train.py", line 93, in main
    trainer.train(cfg,
  File "/workspace/neuralangelo/projects/neuralangelo/trainer.py", line 110, in train
    super().train(cfg, data_loader, single_gpu, profile, show_pbar)
  File "/workspace/neuralangelo/projects/nerf/trainers/base.py", line 115, in train
    super().train(cfg, data_loader, single_gpu, profile, show_pbar)
  File "/workspace/neuralangelo/imaginaire/trainers/base.py", line 512, in train
    self.end_of_iteration(data, current_epoch, current_iteration)
  File "/workspace/neuralangelo/imaginaire/trainers/base.py", line 319, in end_of_iteration
    self._end_of_iteration(data, current_epoch, current_iteration)
  File "/workspace/neuralangelo/projects/nerf/trainers/base.py", line 51, in _end_of_iteration
    raise ValueError("Training loss has gone to NaN!!!")
ValueError: Training loss has gone to NaN!!!
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10326) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>

I am running it from a Docker setup and used a simple git clone to get the latest main branch.

NVlabs / neuralangelo

The loss becomes 'nan' when resuming training from checkpoints. #21