Closed loocy3 closed 2 years ago
I'm also having these kind of issues. Training in the same MegaDepth dataset with different configurations of U-Net (encoder pretrained on other data, frozen encoder, deleting decoder, etc). All of them lead to NaN at some point during the optimization. I didn´t conclude yet if they come from the optimization or from features directly.
Edit: I did not change the random seed either and the error does not repeat in the same iteration. Seems to appear randomly in the middle of training.
This is concerning; let me dig into it (this will likely take me a few days).
[11/02/2021 07:16:05 pixloc INFO] [E 7 | it 2450] loss {total 3.257E+00, reprojection_error/0 9.695E+00, reprojection_error/1 8.376E+00, reprojection_error/2 8.366E+00, reprojection_error 8.366E+00, reprojection_error/init 3.127E+01} [11/02/2021 07:16:06 pixloc.pixlib.models.two_view_refiner WARNING] NaN detected ['error', tensor([ nan, 5.0000e+01, 1.4714e-01, 2.6252e-03, 2.5921e-02, 3.2593e-02], device='cuda:0', grad_fn=
), 'loss', tensor([ nan, 0.0000, 0.0490, 0.0009, 0.0086, 0.0109], device='cuda:0', grad_fn= )] [W python_anomaly_mode.cpp:104] Warning: Error detected in PowBackward1. Traceback of forward call that caused the error: File "/home/jmorlana/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/jmorlana/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/jmorlana/pixloc/pixloc/pixlib/train.py", line 391, in main_worker(0, conf, output_dir, args) File "/home/jmorlana/pixloc/pixloc/pixlib/train.py", line 358, in main_worker training(rank, conf, output_dir, args) File "/home/jmorlana/pixloc/pixloc/pixlib/train.py", line 259, in training losses = loss_fn(pred, data) File "/home/jmorlana/pixloc/pixloc/pixlib/models/two_view_refiner.py", line 151, in loss err = reprojection_error(T_opt).clamp(max=self.conf.clamp_error) File "/home/jmorlana/pixloc/pixloc/pixlib/models/two_view_refiner.py", line 133, in reprojection_error err = scaled_barron(1., 2.)(err)[0]/4 File "/home/jmorlana/pixloc/pixloc/pixlib/geometry/losses.py", line 81, in return lambda x: scaled_loss( File "/home/jmorlana/pixloc/pixloc/pixlib/geometry/losses.py", line 18, in scaled_loss loss, loss_d1, loss_d2 = fn(x/a2) File "/home/jmorlana/pixloc/pixloc/pixlib/geometry/losses.py", line 82, in x, lambda y: barron_loss(y, y.new_tensor(a)), c) File "/home/jmorlana/pixloc/pixloc/pixlib/geometry/losses.py", line 59, in barron_loss torch.pow(x / beta_safe + 1., 0.5 * alpha) - 1.) (function _print_stack)
Thank you!
Thank you for the analysis. I have reproduced the issue:
[W python_anomaly_mode.cpp:104] Warning: Error detected in MulBackward0. Traceback of forward call that caused the error:
File "pixloc/pixlib/train.py", line 417, in
RuntimeError:Function 'PowBackward1' returned nan values in its 0th output #16
I believe that the issue has been addressed by https://github.com/cvg/pixloc/commit/8937e29baa49e62326e9b9a98766e48420a563fb and https://github.com/cvg/pixloc/commit/0ab0e795a443c67ccb948b6fa375393a5b98c093. Can you please confirm that this helps? I will continue to investigate other sources of instabilities.
i tested the change code ,but get the same error .
@angiend what dataset are you training with? at which iteration does it crash? with what version of PyTorch?
@Skydes i retrain on CMU dataset, crash at "E 65| it 800 "(3000 iter at each epoch),and my pytorch version is 1.9.1
The training has usually fully converged at epoch 20 so this should not prevent reproducing the results. Could give a try to PyTorch 1.7.1? I have tried both 1.7.1 and 1.10.0 and both work fine.
Thanks, I have test 3 epochs and I think this issue has been fixed.
After 21850 training iterates, I got NAN in UNet extracted features. Could you give any advice that where of the source code should I look into?