Closed An-BN closed 3 years ago
@nithinraok
Thanks for creating this issue:
Can you provide the following info:
Describe the bug
I trained Speaker Net, but when i trained some epoch, i got INF loss, the following iteration got NAN loss.
Steps/Code to reproduce bug
I used transfer learning in speaker_reco_fintune
I tried log the value of input. This is here:
[NeMo I 2021-08-05 03:12:47 label_models:200] Input_signal: tensor([-7.2640e+00, -5.2837e+00, 5.3772e-01, -8.6496e+00, 5.1234e+00, -9.9597e+00, -3.7034e+00, -1.6893e-01, 1.1892e+00, -3.2955e-01, -5.0978e+01, 2.5176e+00, 3.3735e+00, -8.2932e+00, -5.8100e-02, 8.9195e-01, -7.8204e+00, 2.8190e+00, 2.1965e+00, -4.5658e+00, 4.6412e+00, 7.7891e-01, -8.7750e+00, -3.3856e-01, -1.5517e+00, 1.3796e+00, -3.7535e+01, -1.6224e-01, -5.6561e+00, 4.7903e-01, 2.5053e+00, -1.1971e+01, 4.5067e+00, -7.0030e+00, -8.0738e+00, 5.4799e+00, -5.5992e+01, 1.3271e+00, -5.6283e+00, -7.8360e+00, -3.5651e+00, 4.0000e+00, -7.0020e+00, 1.6409e+00, -3.5048e-01, 2.2463e+00, -5.0571e+00, -1.6083e+00, -6.0765e+01, 4.7173e-02, 5.3641e+00, 5.8133e+00, 3.2504e+01, 9.3649e-01, 1.5709e+01, 1.8597e+00, -2.1429e+01, -2.7246e-01, -1.8408e+00, 5.6009e+00, 5.2642e+00, 1.4973e+00, 4.6505e+00, 1.5815e+00], device='cuda:0') [NeMo I 2021-08-05 03:12:47 label_models:201] Processed signal: tensor([-0.0238, -0.1374, -0.0758, -0.0362, -0.0437, -0.0189, -0.0039, -0.0015, -0.1398, -0.0183, -0.0130, -0.1774, -0.0230, -0.0103, -0.0147, -0.0155, -0.0143, -0.0106, -0.0170, -0.0085, -0.0119, -0.0261, -0.0108, -0.0110, -0.0078, -0.0209, -0.0518, -0.0134, -0.0800, -0.0375, -0.0252, -0.0052, -0.0038, -0.0338, -0.1895, -0.0297, 0.0006, -0.0195, -0.0073, -0.0771, -0.0071, -0.0300, -0.0399, -0.0247, -0.2608, -0.0256, -0.0236, -0.0480, -0.0120, -0.0285, -0.0789, -0.0340, -0.0088, 0.0015, -0.0135, -0.0242, -0.0280, -0.0178, -0.0084, -0.0225, -0.0489, -0.0181, -0.0512, -0.0230], device='cuda:0') [NeMo I 2021-08-05 03:12:47 label_models:202] Encoded: tensor([775.9235, 761.7243, 814.4641, 779.1350, 728.0189, 749.5628, 742.4701, 738.1171, 804.6092, 752.2346, 722.1220, 811.1537, 756.2740, 741.3098, 787.4878, 747.1573, 753.1722, 791.8665, 829.0006, 767.3861, 721.6051, 744.1120, 740.8286, 764.8818, 770.6669, 769.8176, 759.4538, 747.7944, 821.2345, 730.4551, 735.5352, 760.4753, 746.2045, 791.3550, 773.8812, 825.3999, 725.9216, 758.0412, 770.6284, 784.1265, 753.4616, 770.6104, 762.1423, 768.8573, 832.8608, 805.1642, 817.6407, 758.5172, 747.1414, 771.8086, 759.2638, 774.9775, 789.0172, 748.1165, 811.1917, 751.2094, 751.8005, 750.8981, 745.0760, 742.4553, 734.7489, 785.9377, 757.0728, 758.7659], device='cuda:0') [NeMo I 2021-08-05 03:12:47 label_models:203] logits: tensor([ 837.9282, 933.0463, 904.5047, 1146.3391, 1270.3425, 829.1025, 939.2279, 859.5983, 1080.9069, 1055.0779, 1325.0914, 895.8984, 928.6825, 970.0303, 1194.4274, 1291.1360, 892.8967, 919.5685, 620.6583, 847.6467, 1311.5002, 911.1189, 973.6673, 891.4790, 1165.7302, 800.3695, 1237.9429, 1332.2549, 925.8285, 1257.8645, 913.4201, 1279.2789, 915.7543, 866.2889, 1012.7273, 1171.7438, 831.6862, 872.0796, 901.1927, 853.4565, 803.9324, 842.8481, 868.1702, 890.2934, 1128.2134, 783.4645, 770.2278, 903.2722, 1225.4275, 1173.8964, 966.3867, 953.6735, 1124.6262, 1163.0758, 1162.4150, 924.2391, 885.0515, 849.4289, 1309.8447, 884.9170, 856.6614, 1206.6677, 880.4076, 914.4090], device='cuda:0', grad_fn=<SumBackward1>) --------- embs: tensor([ -15.6109, -6.5232, -3.1432, 28.2739, 26.4231, -22.0438, -8.6381, -13.7220, 12.4318, 12.8998, 26.3078, -14.2675, -7.4201, -8.2558, 44.7105, 34.8373, -16.3274, -14.7989, -192.9999, -15.0163, 41.0528, -13.1249, -7.0137, -14.6647, 28.7706, -15.4204, 32.1244, 32.0851, -3.4967, 26.7447, -10.4239, 40.4478, -13.6846, -9.6578, -4.6466, 70.1545, -15.0264, -10.8492, -12.8593, -19.7867, -18.7421, -17.2652, -15.7453, -11.6013, 37.4433, -13.1238, -16.3968, -15.6007, 21.0131, 44.6906, -4.6221, -9.0381, 37.1265, 17.0363, 40.0020, -12.1931, -18.4307, -18.2255, 26.2079, -10.5125, -12.6007, 36.1346, -18.8209, -14.5773], device='cuda:0', grad_fn=<SumBackward1>) Epoch 10: 13%|█▎ | 809/6357 [08:56<1:01:20, 1.51it/s, loss=inf, v_num=4-35][NeMo I 2021-08-05 03:12:47 audio_to_label:350] Features type: 1.312255859375 ----- torch.Size([60799]) [NeMo I 2021-08-05 03:12:47 label_models:200] Input_signal: tensor([-7.3667e+00, 4.3621e+00, -7.8351e+00, -6.1382e+00, 1.1234e+00, -7.8352e+00, 3.8215e+01, -6.5747e+00, 1.4901e+00, -6.3032e+00, 1.6229e+00, 4.3975e+00, 6.8592e-02, 4.0246e-01, -7.6963e-01, 6.6508e-01, 1.0280e+00, 7.9978e+00, 8.3737e-01, -9.2365e-01, 5.5330e+00, 1.1063e+01, 1.4176e+00, -8.3671e-01, 5.8054e-01, -7.1677e+00, -2.1789e-01, -8.1073e-01, 7.7098e-01, -7.3476e+00, 4.6917e-01, 2.5451e+00, 3.1671e-01, 2.3164e-01, -8.6706e+00, 2.4546e+00, -6.9703e+00, -6.1533e+00, -4.2260e-01, 1.6372e-01, 1.8021e-01, -3.5878e-01, 1.6875e+00, 2.2597e+00, 5.2333e-01, -7.4447e+00, -1.8976e-01, 1.5130e+00, 2.7506e+00, 1.2589e+01, -3.2763e-01, -3.1738e-01, -6.3092e+00, -1.4495e+01, 2.2881e+00, -1.5554e+00, 1.6894e+00, -7.2745e+00, -2.5398e-02, -7.3677e+00, 2.1140e+00, 2.3412e+00, 1.7505e+00, 7.2517e+01], device='cuda:0') [NeMo I 2021-08-05 03:12:47 label_models:201] Processed signal: tensor([-0.0110, -0.0172, -0.0152, -0.0177, -0.0064, -0.0467, -0.0036, -0.0183, -0.0460, -0.0193, -0.0357, -0.0074, -0.0592, -0.0272, -0.0161, -0.0178, -0.0515, -0.0169, -0.0309, -0.0081, -0.0219, -0.0155, -0.0187, -0.0715, -0.0151, -0.0398, -0.0326, -0.0446, -0.0133, -0.0278, -0.1472, -0.0245, -0.0451, -0.0238, -0.0154, -0.0143, -0.0107, -0.0138, -0.0198, -0.1016, -0.0357, -0.0269, 0.0035, -0.0332, -0.0304, -0.0137, -0.0897, -0.1263, -0.0245, -0.0462, -0.0238, -0.0120, -0.0258, -0.0222, -0.0277, -0.0139, -0.0184, -0.0076, -0.0403, -0.0533, -0.0268, -0.0101, -0.1688, -0.0063], device='cuda:0') [NeMo I 2021-08-05 03:12:47 label_models:202] Encoded: tensor([688.4766, 715.8893, 689.5655, 708.5850, 704.4224, 773.0853, 721.6960, 727.1276, 711.0944, 678.0993, 743.5980, 694.5203, 724.0637, 684.3538, 727.6652, 710.1851, 701.5847, 704.3207, 721.2671, 691.4686, 685.7607, 713.0606, 707.0889, 704.0615, 708.6085, 682.6095, 693.8227, 732.7479, 745.0822, 692.2618, 787.9812, 716.8567, 710.3370, 706.4578, 692.3884, 706.5530, 676.1014, 687.0576, 779.3427, 700.9290, 758.4126, 697.1790, 745.2277, 684.4161, 705.5687, 674.8506, 760.0540, 811.5912, 685.9936, 705.9686, 707.3126, 689.3103, 687.7748, 761.4349, 718.7142, 698.1862, 697.4583, 693.9688, 697.5015, 709.8937, 739.5364, 727.1003, 721.4929, 701.9783], device='cuda:0') [NeMo I 2021-08-05 03:12:47 label_models:203] logits: tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0', grad_fn=<SumBackward1>) --------- embs: tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0', grad_fn=<SumBackward1>) Epoch 10: 13%|█▎ | 810/6357 [08:56<1:01:16, 1.51it/s, loss=inf, v_num=4-35] Epoch 10: 13%|█▎ | 810/6357 [08:56<1:01:16, 1.51it/s, loss=nan, v_num=4-35][NeMo I 2021-08-05 03:12:48 audio_to_label:350] Features type: -1.804046630859375 ----- torch.Size([28479])
Environment overview (please complete the following information)
- I used docker of NeMo
Additional context
GPU: DGX-1
hello, have fixed this problem yet? i got inf loss too.
Describe the bug
I trained Speaker Net, but when i trained some epoch, i got INF loss, the following iteration got NAN loss.
Steps/Code to reproduce bug
I used transfer learning in speaker_reco_fintune
I tried log the value of input. This is here:
Environment overview (please complete the following information)
Additional context
GPU: DGX-1