chrischoy / DeepGlobalRegistration

[CVPR 2020 Oral] A differentiable framework for 3D registration
Other
468 stars 85 forks source link

got infinite loss when training modelnet40 dataset #20

Open lixiang-ucas opened 3 years ago

lixiang-ucas commented 3 years ago

Hello Christopher, I'm trying to evaluate your model on the synthetic ModelNet40 dataset. I wrote my own dataloader and trained your model with the default settings. But after a few steps, the loss becomes infinite. Can you give some suggestions about how to train your model on ModelNet40 dataset? (I've changed and tried different voxel sizes choose from 0.01 to 0.05, but did not work.)

A piece of logs during training can be found below. It seems the returned 'ws' variables from weighted_procrustes are all lower than the threshold (10 in your code), which means all input items have less than 10 valid correspondences pairs.

12/27 19:11:19 => loading weights for inlier model './FCGF_pretrained_3dmatch.pth' 12/27 19:11:19 => Loaded base model weights from './FCGF_pretrained_3dmatch.pth' 12/27 19:11:19 Inlier weight not found in './FCGF_pretrained_3dmatch.pth' /home/mmvc/anaconda3/envs/XL_py3_cuda10/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:449: UserWarning: To get the last learning rate computed by the scheduler, please use get_last_lr(). "please use get_last_lr().", UserWarning) 12/27 19:11:19 Epoch: 1, LR: [0.1] num valid_mask, weights.mean(), weights.max 6141 0.49184879660606384 0.8573861122131348 num_valid, ws 8 tensor([379.1353, 379.4299, 379.1117, 378.9486, 366.3976, 379.6558, 379.5316, 379.6907]) rot_error, trans_error, loss 1.0711798667907715 0.260419100522995 1.3315989971160889 12/27 19:11:23 Train Epoch: 1 [0/1230], Current Loss: 2.011e+00, Correspondence acc: 2.116e-02 , Precision: 0.0156, Recall: 0.0154, F1: 0.0155, TPR: 0.0154, TNR: 0.9790, BAcc: 0.4972 RTE: 2.604e-01, RRE: 6.137e+01, Succ rate: 2.500000e-01 Avg num valid: 8.000000e+00 Data time: 2.2289, Train time: 1.5519, NN search time: 1.772e-02, Total time: 3.7808 num valid_mask, weights.mean(), weights.max 6141 0.4818011224269867 0.8922387957572937 num_valid, ws 8 tensor([370.2599, 365.9893, 370.7393, 370.6736, 370.6388, 370.6003, 370.6378, 370.5956]) rot_error, trans_error, loss 0.7573173642158508 0.19806239008903503 0.9553797245025635 num valid_mask, weights.mean(), weights.max 6143 0.45103055238723755 0.9931040406227112 num_valid, ws 8 tensor([345.9790, 343.8760, 344.7417, 344.6624, 344.8690, 345.0934, 357.2584, 344.6322]) rot_error, trans_error, loss 0.7590370774269104 0.26072144508361816 1.0197584629058838 num valid_mask, weights.mean(), weights.max 6144 0.42308372259140015 0.9964439272880554 num_valid, ws 8 tensor([327.4283, 324.3058, 324.6727, 324.4388, 324.5334, 324.9588, 324.6275, 324.4608]) rot_error, trans_error, loss 1.3009799718856812 0.21060435473918915 1.5115842819213867 num valid_mask, weights.mean(), weights.max 6144 0.38447126746177673 0.9924339056015015 num_valid, ws 8 tensor([309.3771, 293.2216, 293.0517, 294.7009, 293.1519, 292.7238, 292.6430, 293.3217]) rot_error, trans_error, loss 0.544593334197998 0.2057737112045288 0.7503671646118164 num valid_mask, weights.mean(), weights.max 6143 0.34405946731567383 0.8312152028083801 num_valid, ws 8 tensor([265.8239, 263.1904, 265.2418, 266.2042, 263.8054, 263.6702, 263.0773, 262.8550]) rot_error, trans_error, loss 1.1339480876922607 0.24252642691135406 1.3764744997024536 num valid_mask, weights.mean(), weights.max 6144 0.2863656282424927 0.9999984502792358 num_valid, ws 8 tensor([222.1238, 218.1181, 215.6836, 226.8676, 217.5520, 217.8743, 220.7595, 220.4514]) rot_error, trans_error, loss 1.0707015991210938 0.27141687273979187 1.3421183824539185 num valid_mask, weights.mean(), weights.max 6144 0.21661357581615448 0.34146520495414734 num_valid, ws 8 tensor([173.3432, 162.5075, 164.1453, 168.3337, 164.7490, 168.9405, 158.8151, 170.0394]) rot_error, trans_error, loss 0.8970396518707275 0.30696189403533936 1.2040014266967773 num valid_mask, weights.mean(), weights.max 6138 0.22650477290153503 0.9914805889129639 num_valid, ws 8 tensor([171.5588, 173.3483, 179.2511, 173.2488, 172.3625, 172.5608, 172.9400, 176.2171]) rot_error, trans_error, loss 1.0415538549423218 0.34417128562927246 1.3857251405715942 num valid_mask, weights.mean(), weights.max 6116 0.17247292399406433 0.35712677240371704 num_valid, ws 8 tensor([132.9017, 130.7651, 128.1572, 133.2996, 131.4898, 134.0894, 132.3888, 135.9132]) rot_error, trans_error, loss 0.9503939151763916 0.22698664665222168 1.1773805618286133 num valid_mask, weights.mean(), weights.max 6064 0.1356555074453354 0.7415195107460022 num_valid, ws 8 tensor([104.9554, 104.2914, 102.3165, 103.3207, 103.8977, 103.7889, 103.3343, 105.0705]) rot_error, trans_error, loss 0.5307180881500244 0.16665557026863098 0.697373628616333 num valid_mask, weights.mean(), weights.max 6048 0.12273039668798447 0.5408609509468079 num_valid, ws 8 tensor([92.7914, 91.0353, 90.9816, 92.6906, 98.6065, 94.2565, 96.6829, 94.6045]) rot_error, trans_error, loss 0.7709248661994934 0.1336054652929306 0.9045303463935852 num valid_mask, weights.mean(), weights.max 5288 0.07958836853504181 0.17408770322799683 num_valid, ws 8 tensor([59.2375, 49.9505, 57.0326, 52.8035, 60.2806, 60.2326, 60.3720, 57.5639]) rot_error, trans_error, loss 1.1103489398956299 0.25814875960350037 1.3684978485107422 num valid_mask, weights.mean(), weights.max 3978 0.06213882565498352 0.18988150358200073 num_valid, ws 8 tensor([32.8604, 31.9178, 39.7851, 39.5970, 37.5519, 40.9863, 38.9679, 40.8111]) rot_error, trans_error, loss 1.3329100608825684 0.35916927456855774 1.6920795440673828 num valid_mask, weights.mean(), weights.max 3824 0.05560467392206192 0.16689369082450867 num_valid, ws 8 tensor([34.9690, 44.5694, 32.6451, 32.3816, 34.3029, 31.2937, 31.9045, 32.9062]) rot_error, trans_error, loss 1.1255593299865723 0.261192262172699 1.386751413345337 num valid_mask, weights.mean(), weights.max 2160 0.04612640663981438 0.19192813336849213 num_valid, ws 8 tensor([21.0208, 19.3035, 14.4086, 24.3450, 20.2616, 22.0708, 29.0271, 16.9002]) rot_error, trans_error, loss 0.9262062311172485 0.3681674003601074 1.2943737506866455 num valid_mask, weights.mean(), weights.max 4355 0.05530725419521332 0.10722033679485321 num_valid, ws 8 tensor([34.5013, 41.0727, 37.0651, 35.8570, 38.5333, 39.4196, 36.3496, 38.5415]) rot_error, trans_error, loss 0.977398157119751 0.1942673623561859 1.1716655492782593 num valid_mask, weights.mean(), weights.max 1208 0.03593476861715317 0.09210973232984543 num_valid, ws 5 tensor([11.2543, 8.6302, 11.9625, 9.4134, 14.0098, 13.0315, 11.4335, 5.5197]) rot_error, trans_error, loss 1.2372373342514038 0.310983270406723 1.81670343875885 num valid_mask, weights.mean(), weights.max 2057 0.040355708450078964 0.10537022352218628 num_valid, ws 8 tensor([17.9894, 16.3948, 18.6110, 18.1689, 11.5592, 15.3969, 13.3000, 15.5626]) rot_error, trans_error, loss 0.769038736820221 0.1769949048757553 0.9460336565971375 num valid_mask, weights.mean(), weights.max 1714 0.036536455154418945 0.0762948989868164 num_valid, ws 7 tensor([17.7795, 12.5434, 10.8226, 14.2316, 14.7396, 10.7642, 7.0441, 12.2532]) rot_error, trans_error, loss 0.6872734427452087 0.21808893978595734 0.9117898941040039 num valid_mask, weights.mean(), weights.max 572 0.028709784150123596 0.08586414158344269 num_valid, ws 0 tensor([2.6134, 7.0487, 4.7059, 3.5765, 4.7559, 5.8939, 6.2377, 0.6551]) rot_error, trans_error, loss 1.3239400386810303 0.4019933044910431 nan 12/27 19:12:02 Loss is infinite, abort num valid_mask, weights.mean(), weights.max 680 0.03068387135863304 0.09823115170001984 num_valid, ws 0 tensor([4.6802, 3.1837, 6.2774, 6.7470, 5.4355, 6.1506, 3.3652, 6.2467]) rot_error, trans_error, loss 1.7035499811172485 0.46357956528663635 nan 12/27 19:12:04 Loss is infinite, abort num valid_mask, weights.mean(), weights.max 1567 0.03716740012168884 0.09059126675128937 num_valid, ws 5 tensor([18.5755, 16.7832, 11.3389, 9.6333, 11.9852, 9.1712, 5.9884, 11.1698]) rot_error, trans_error, loss 0.9750125408172607 0.32420605421066284 1.2242933511734009 num valid_mask, weights.mean(), weights.max 1137 0.03218415006995201 0.09962588548660278 num_valid, ws 3 tensor([12.0679, 8.3144, 9.6743, 9.2133, 3.5999, 8.8996, 11.8518, 10.5215]) rot_error, trans_error, loss 0.5337458252906799 0.19239680469036102 0.6585435271263123 num valid_mask, weights.mean(), weights.max 966 0.029521774500608444 0.10332392156124115 num_valid, ws 0 tensor([8.4000, 8.9959, 6.1997, 8.6553, 9.4742, 7.0023, 8.5738, 4.6189]) rot_error, trans_error, loss 1.0506246089935303 0.3039046823978424 nan 12/27 19:12:10 Loss is infinite, abort num valid_mask, weights.mean(), weights.max 1074 0.02992144785821438 0.10137491673231125 num_valid, ws 2 tensor([12.6404, 8.2678, 3.2425, 12.0476, 6.9744, 5.5812, 9.3145, 8.7726]) rot_error, trans_error, loss 1.611760139465332 0.4179706275463104 1.2622950077056885 num valid_mask, weights.mean(), weights.max 1543 0.034905172884464264 0.09911085665225983 num_valid, ws 5 tensor([10.5303, 9.5016, 17.0358, 9.5487, 14.0452, 16.5256, 8.4574, 10.9382]) rot_error, trans_error, loss 1.512508749961853 0.3802349269390106 2.0282459259033203 num valid_mask, weights.mean(), weights.max 1506 0.03240210562944412 0.1912095695734024 num_valid, ws 4 tensor([10.4049, 8.4145, 12.4113, 8.7133, 10.1371, 16.9338, 7.0592, 9.6483]) rot_error, trans_error, loss 0.7196844816207886 0.161781445145607 0.6110647916793823 num valid_mask, weights.mean(), weights.max 831 0.02881685458123684 0.10011401772499084 num_valid, ws 0 tensor([7.1511, 2.9459, 9.2325, 5.5985, 2.2618, 4.2966, 4.5117, 9.9505]) rot_error, trans_error, loss 1.474617600440979 0.3669683635234833 nan 12/27 19:12:18 Loss is infinite, abort num valid_mask, weights.mean(), weights.max 449 0.026938147842884064 0.09409534186124802 num_valid, ws 0 tensor([5.2363, 2.1232, 1.9240, 1.4735, 2.0403, 1.9763, 6.4409, 3.1939]) rot_error, trans_error, loss 1.2611165046691895 0.29535406827926636 nan 12/27 19:12:20 Loss is infinite, abort

tatsy commented 2 years ago

Sorry to annoying you, but I came across the same error. Could you share a set of valid parameters for ModelNet40, please?

gitouni commented 2 years ago

Maybe you should train FCGF model on ModelNet40 first?