Open BoMingZhao opened 1 year ago
The same problem here, when resuming the training from the checkpoint, the loss went to NaN. 😢 It seems that only the outside network weights were loaded while the center part did not.
Hi @BoMingZhao @AuthorityWang
Thanks for reporting this. This is probably a bug and let me look into where the problem is.
Hi @BoMingZhao @AuthorityWang
I set the checkpoint.save_iter
(the line below) to 2k so I can frequently inspect the results. I have not been able to reproduce the issue with early training iterations.
Could you help me pin down which loss is NaN with your existing checkpoints?
Could you help me pin down which loss is NaN with your existing checkpoints?
Sorry for replying late. I've been training another dataset these past few days and didn't encounter the 'loss nan' issue. I'm now trying to retrain the dataset where the bug previously appeared, hoping to reproduce the problem.
Hi @BoMingZhao @AuthorityWang
I set the
checkpoint.save_iter
(the line below) to 2k so I can frequently inspect the results. I have not been able to reproduce the issue with early training iterations.Could you help me pin down which loss is NaN with your existing checkpoints?
@mli0603 Hi, I find the render loss is nan, while both the eikonal loss and curvature loss are 0.
Have you solved it? I have the same problem
@BoMingZhao @AuthorityWang
We have pushed a commit that potentially fixes the issue of resuming (https://github.com/NVlabs/neuralangelo/commit/c91af8d5098c858df8e8dfa35fba8666d314782b). Please let us know if you still run into the same problem.
If the issue does not appear anymore, feel free to close this.
I got a simlar issue but on run 290,000.
/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_run.py:2089: UserWarning: Run (pqmw9w2j) is finished. The call to `_console_raw_callback` will be ignored. Please make sure that you are using an active run.
lambda data: self._console_raw_callback("stderr", data),
Traceback (most recent call last):
File "train.py", line 104, in <module>
main()
File "train.py", line 93, in main
trainer.train(cfg,
File "/workspace/neuralangelo/projects/neuralangelo/trainer.py", line 110, in train
super().train(cfg, data_loader, single_gpu, profile, show_pbar)
File "/workspace/neuralangelo/projects/nerf/trainers/base.py", line 115, in train
super().train(cfg, data_loader, single_gpu, profile, show_pbar)
File "/workspace/neuralangelo/imaginaire/trainers/base.py", line 512, in train
self.end_of_iteration(data, current_epoch, current_iteration)
File "/workspace/neuralangelo/imaginaire/trainers/base.py", line 319, in end_of_iteration
self._end_of_iteration(data, current_epoch, current_iteration)
File "/workspace/neuralangelo/projects/nerf/trainers/base.py", line 51, in _end_of_iteration
raise ValueError("Training loss has gone to NaN!!!")
ValueError: Training loss has gone to NaN!!!
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10326) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
I am running it from a Docker setup and used a simple git clone
to get the latest main branch.
Hi, Thank you for open-sourcing such great work. I was training my own outdoor scene and when I resumed training from the last epoch, a bug appeared where the training loss turned into 'nan'. Here are some of my terminal outputs, as well as the training results from the last epoch.
Here are my training results.