cvg / pixloc

Back to the Feature: Learning Robust Camera Localization from Pixels to Pose (CVPR 2021)
Apache License 2.0
742 stars 93 forks source link

Got NAN value in training process by the same command in this code #21

Closed VictorZoo closed 2 years ago

VictorZoo commented 2 years ago

Sorry to bother. I download Undistorted_megadepth datasets, and used this command "python -m pixloc.pixlib.train pixloc_megadepth_reproduce --conf pixloc/pixlib/configs/train_pixloc_megadepth.yaml". When in process, I got NAN as follows:

Evaluation: 97%|#########7| 99/102 [00:21<00:00, 5.18it/s]

Evaluation: 98%|#########8| 100/102 [00:21<00:00, 5.18it/s]

Evaluation: 99%|#########9| 101/102 [00:22<00:00, 4.92it/s]

Evaluation: 100%|##########| 102/102 [00:22<00:00, 4.60it/s]

[11/26/2021 01:22:48 pixloc INFO] [Validation] {R_error/0 3.900E+00, t_error/0 1.549E-01, R_error/1 4.264E+00, t_error/1 1.560E-01, R_error/2 3.748E+00, t_error/2 1.468E-01, R_error 3.748E+00, R_error_median 4.480E+00, t_error 1.468E-01, t_error_median 1.756E-01, R_error/init 1.718E+00, t_error/init 1.331E-01, loss/total 1.939E+01, loss/reprojection_error/0 3.426E+01, loss/reprojection_error/1 3.524E+01, loss/reprojection_error/2 2.786E+01, loss/reprojection_error 2.786E+01, loss/reprojection_error_median 3.192E+01, loss/reprojection_error/init 1.634E+01, loss/reprojection_error/init_median 1.069E+01}

[11/26/2021 01:23:24 pixloc.pixlib.models.two_view_refiner WARNING] Few points in batch [('0035', 472, 741)]

[11/26/2021 01:23:31 pixloc INFO] [E 3 | it 550] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 1.343E+01}

[11/26/2021 01:24:15 pixloc INFO] [E 3 | it 600] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 1.996E+01}

[11/26/2021 01:24:28 pixloc.pixlib.models.two_view_refiner WARNING] Few points in batch [('1017', 962, 1173)]

[11/26/2021 01:24:58 pixloc INFO] [E 3 | it 650] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 2.845E+01}

[11/26/2021 01:25:41 pixloc INFO] [E 3 | it 700] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 1.092E+01}

[11/26/2021 01:26:13 pixloc.pixlib.models.two_view_refiner WARNING] Few points in batch [('0348', 167, 168)]

[11/26/2021 01:26:24 pixloc INFO] [E 3 | it 750] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 8.301E+00}

[11/26/2021 01:27:08 pixloc INFO] [E 3 | it 800] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 2.569E+01}

[11/26/2021 01:27:19 pixloc.pixlib.models.two_view_refiner WARNING] Few points in batch [('5009', 168, 165)]

[11/26/2021 01:27:51 pixloc INFO] [E 3 | it 850] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 2.809E+01}

[11/26/2021 01:28:23 pixloc.pixlib.models.two_view_refiner WARNING] Few points in batch [('5005', 140, 136)]

[11/26/2021 01:28:34 pixloc INFO] [E 3 | it 900] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 8.891E+00}

It occurred in epoch 3, but when I try another the same command, it occurred in epoch 1 iteration 1180.

I just want to reproduce your results and haven't change any parameters, just the same as you post. I wondered is there something I ignored or something going wrong? My pytorch version is 1.7.1, Numpy 1.21.2.

Thank you so much.

sarlinpe commented 2 years ago

That is certainly an issue.

  1. Are you training with the latest commits https://github.com/cvg/pixloc/commit/0ab0e795a443c67ccb948b6fa375393a5b98c093 ?
  2. Can you run with anomaly detection enabled to figure out where this comes from?
  3. Let's discuss in https://github.com/cvg/pixloc/issues/10 instead. So far, with the last fixes, I haven't managed to reproduce the NaNs.