NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.06k stars 2.51k forks source link

INF loss when train SpeakerNet? #2623

Closed An-BN closed 3 years ago

An-BN commented 3 years ago

Describe the bug

I trained Speaker Net, but when i trained some epoch, i got INF loss, the following iteration got NAN loss.

Steps/Code to reproduce bug

I used transfer learning in speaker_reco_fintune

I tried log the value of input. This is here:

[NeMo I 2021-08-05 03:12:47 label_models:200] Input_signal: tensor([-7.2640e+00, -5.2837e+00,  5.3772e-01, -8.6496e+00,  5.1234e+00,
            -9.9597e+00, -3.7034e+00, -1.6893e-01,  1.1892e+00, -3.2955e-01,
            -5.0978e+01,  2.5176e+00,  3.3735e+00, -8.2932e+00, -5.8100e-02,
             8.9195e-01, -7.8204e+00,  2.8190e+00,  2.1965e+00, -4.5658e+00,
             4.6412e+00,  7.7891e-01, -8.7750e+00, -3.3856e-01, -1.5517e+00,
             1.3796e+00, -3.7535e+01, -1.6224e-01, -5.6561e+00,  4.7903e-01,
             2.5053e+00, -1.1971e+01,  4.5067e+00, -7.0030e+00, -8.0738e+00,
             5.4799e+00, -5.5992e+01,  1.3271e+00, -5.6283e+00, -7.8360e+00,
            -3.5651e+00,  4.0000e+00, -7.0020e+00,  1.6409e+00, -3.5048e-01,
             2.2463e+00, -5.0571e+00, -1.6083e+00, -6.0765e+01,  4.7173e-02,
             5.3641e+00,  5.8133e+00,  3.2504e+01,  9.3649e-01,  1.5709e+01,
             1.8597e+00, -2.1429e+01, -2.7246e-01, -1.8408e+00,  5.6009e+00,
             5.2642e+00,  1.4973e+00,  4.6505e+00,  1.5815e+00], device='cuda:0')
[NeMo I 2021-08-05 03:12:47 label_models:201] Processed signal: tensor([-0.0238, -0.1374, -0.0758, -0.0362, -0.0437, -0.0189, -0.0039, -0.0015,
            -0.1398, -0.0183, -0.0130, -0.1774, -0.0230, -0.0103, -0.0147, -0.0155,
            -0.0143, -0.0106, -0.0170, -0.0085, -0.0119, -0.0261, -0.0108, -0.0110,
            -0.0078, -0.0209, -0.0518, -0.0134, -0.0800, -0.0375, -0.0252, -0.0052,
            -0.0038, -0.0338, -0.1895, -0.0297,  0.0006, -0.0195, -0.0073, -0.0771,
            -0.0071, -0.0300, -0.0399, -0.0247, -0.2608, -0.0256, -0.0236, -0.0480,
            -0.0120, -0.0285, -0.0789, -0.0340, -0.0088,  0.0015, -0.0135, -0.0242,
            -0.0280, -0.0178, -0.0084, -0.0225, -0.0489, -0.0181, -0.0512, -0.0230],
           device='cuda:0')
[NeMo I 2021-08-05 03:12:47 label_models:202] Encoded: tensor([775.9235, 761.7243, 814.4641, 779.1350, 728.0189, 749.5628, 742.4701,
            738.1171, 804.6092, 752.2346, 722.1220, 811.1537, 756.2740, 741.3098,
            787.4878, 747.1573, 753.1722, 791.8665, 829.0006, 767.3861, 721.6051,
            744.1120, 740.8286, 764.8818, 770.6669, 769.8176, 759.4538, 747.7944,
            821.2345, 730.4551, 735.5352, 760.4753, 746.2045, 791.3550, 773.8812,
            825.3999, 725.9216, 758.0412, 770.6284, 784.1265, 753.4616, 770.6104,
            762.1423, 768.8573, 832.8608, 805.1642, 817.6407, 758.5172, 747.1414,
            771.8086, 759.2638, 774.9775, 789.0172, 748.1165, 811.1917, 751.2094,
            751.8005, 750.8981, 745.0760, 742.4553, 734.7489, 785.9377, 757.0728,
            758.7659], device='cuda:0')
[NeMo I 2021-08-05 03:12:47 label_models:203] logits: tensor([ 837.9282,  933.0463,  904.5047, 1146.3391, 1270.3425,  829.1025,
             939.2279,  859.5983, 1080.9069, 1055.0779, 1325.0914,  895.8984,
             928.6825,  970.0303, 1194.4274, 1291.1360,  892.8967,  919.5685,
             620.6583,  847.6467, 1311.5002,  911.1189,  973.6673,  891.4790,
            1165.7302,  800.3695, 1237.9429, 1332.2549,  925.8285, 1257.8645,
             913.4201, 1279.2789,  915.7543,  866.2889, 1012.7273, 1171.7438,
             831.6862,  872.0796,  901.1927,  853.4565,  803.9324,  842.8481,
             868.1702,  890.2934, 1128.2134,  783.4645,  770.2278,  903.2722,
            1225.4275, 1173.8964,  966.3867,  953.6735, 1124.6262, 1163.0758,
            1162.4150,  924.2391,  885.0515,  849.4289, 1309.8447,  884.9170,
             856.6614, 1206.6677,  880.4076,  914.4090], device='cuda:0',
           grad_fn=<SumBackward1>) --------- embs: tensor([ -15.6109,   -6.5232,   -3.1432,   28.2739,   26.4231,  -22.0438,
              -8.6381,  -13.7220,   12.4318,   12.8998,   26.3078,  -14.2675,
              -7.4201,   -8.2558,   44.7105,   34.8373,  -16.3274,  -14.7989,
            -192.9999,  -15.0163,   41.0528,  -13.1249,   -7.0137,  -14.6647,
              28.7706,  -15.4204,   32.1244,   32.0851,   -3.4967,   26.7447,
             -10.4239,   40.4478,  -13.6846,   -9.6578,   -4.6466,   70.1545,
             -15.0264,  -10.8492,  -12.8593,  -19.7867,  -18.7421,  -17.2652,
             -15.7453,  -11.6013,   37.4433,  -13.1238,  -16.3968,  -15.6007,
              21.0131,   44.6906,   -4.6221,   -9.0381,   37.1265,   17.0363,
              40.0020,  -12.1931,  -18.4307,  -18.2255,   26.2079,  -10.5125,
             -12.6007,   36.1346,  -18.8209,  -14.5773], device='cuda:0',
           grad_fn=<SumBackward1>)

Epoch 10:  13%|█▎        | 809/6357 [08:56<1:01:20,  1.51it/s, loss=inf, v_num=4-35][NeMo I 2021-08-05 03:12:47 audio_to_label:350] Features type: 1.312255859375 ----- torch.Size([60799])

[NeMo I 2021-08-05 03:12:47 label_models:200] Input_signal: tensor([-7.3667e+00,  4.3621e+00, -7.8351e+00, -6.1382e+00,  1.1234e+00,
            -7.8352e+00,  3.8215e+01, -6.5747e+00,  1.4901e+00, -6.3032e+00,
             1.6229e+00,  4.3975e+00,  6.8592e-02,  4.0246e-01, -7.6963e-01,
             6.6508e-01,  1.0280e+00,  7.9978e+00,  8.3737e-01, -9.2365e-01,
             5.5330e+00,  1.1063e+01,  1.4176e+00, -8.3671e-01,  5.8054e-01,
            -7.1677e+00, -2.1789e-01, -8.1073e-01,  7.7098e-01, -7.3476e+00,
             4.6917e-01,  2.5451e+00,  3.1671e-01,  2.3164e-01, -8.6706e+00,
             2.4546e+00, -6.9703e+00, -6.1533e+00, -4.2260e-01,  1.6372e-01,
             1.8021e-01, -3.5878e-01,  1.6875e+00,  2.2597e+00,  5.2333e-01,
            -7.4447e+00, -1.8976e-01,  1.5130e+00,  2.7506e+00,  1.2589e+01,
            -3.2763e-01, -3.1738e-01, -6.3092e+00, -1.4495e+01,  2.2881e+00,
            -1.5554e+00,  1.6894e+00, -7.2745e+00, -2.5398e-02, -7.3677e+00,
             2.1140e+00,  2.3412e+00,  1.7505e+00,  7.2517e+01], device='cuda:0')
[NeMo I 2021-08-05 03:12:47 label_models:201] Processed signal: tensor([-0.0110, -0.0172, -0.0152, -0.0177, -0.0064, -0.0467, -0.0036, -0.0183,
            -0.0460, -0.0193, -0.0357, -0.0074, -0.0592, -0.0272, -0.0161, -0.0178,
            -0.0515, -0.0169, -0.0309, -0.0081, -0.0219, -0.0155, -0.0187, -0.0715,
            -0.0151, -0.0398, -0.0326, -0.0446, -0.0133, -0.0278, -0.1472, -0.0245,
            -0.0451, -0.0238, -0.0154, -0.0143, -0.0107, -0.0138, -0.0198, -0.1016,
            -0.0357, -0.0269,  0.0035, -0.0332, -0.0304, -0.0137, -0.0897, -0.1263,
            -0.0245, -0.0462, -0.0238, -0.0120, -0.0258, -0.0222, -0.0277, -0.0139,
            -0.0184, -0.0076, -0.0403, -0.0533, -0.0268, -0.0101, -0.1688, -0.0063],
           device='cuda:0')
[NeMo I 2021-08-05 03:12:47 label_models:202] Encoded: tensor([688.4766, 715.8893, 689.5655, 708.5850, 704.4224, 773.0853, 721.6960,
            727.1276, 711.0944, 678.0993, 743.5980, 694.5203, 724.0637, 684.3538,
            727.6652, 710.1851, 701.5847, 704.3207, 721.2671, 691.4686, 685.7607,
            713.0606, 707.0889, 704.0615, 708.6085, 682.6095, 693.8227, 732.7479,
            745.0822, 692.2618, 787.9812, 716.8567, 710.3370, 706.4578, 692.3884,
            706.5530, 676.1014, 687.0576, 779.3427, 700.9290, 758.4126, 697.1790,
            745.2277, 684.4161, 705.5687, 674.8506, 760.0540, 811.5912, 685.9936,
            705.9686, 707.3126, 689.3103, 687.7748, 761.4349, 718.7142, 698.1862,
            697.4583, 693.9688, 697.5015, 709.8937, 739.5364, 727.1003, 721.4929,
            701.9783], device='cuda:0')
[NeMo I 2021-08-05 03:12:47 label_models:203] logits: tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
            nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
            nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
           device='cuda:0', grad_fn=<SumBackward1>) --------- embs: tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
            nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
            nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
           device='cuda:0', grad_fn=<SumBackward1>)

Epoch 10:  13%|█▎        | 810/6357 [08:56<1:01:16,  1.51it/s, loss=inf, v_num=4-35]
Epoch 10:  13%|█▎        | 810/6357 [08:56<1:01:16,  1.51it/s, loss=nan, v_num=4-35][NeMo I 2021-08-05 03:12:48 audio_to_label:350] Features type: -1.804046630859375 ----- torch.Size([28479])

Environment overview (please complete the following information)

Additional context

GPU: DGX-1

titu1994 commented 3 years ago

@nithinraok

nithinraok commented 3 years ago

Thanks for creating this issue:

Can you provide the following info:

  1. Are you using angular softmax loss (or specifically which config were you basing on)
  2. learning rate and epochs
  3. Number of speakers in your finetune dataset.
  4. which pretrain model you are using?
ACheun9 commented 2 years ago

Describe the bug

I trained Speaker Net, but when i trained some epoch, i got INF loss, the following iteration got NAN loss.

Steps/Code to reproduce bug

I used transfer learning in speaker_reco_fintune

I tried log the value of input. This is here:

[NeMo I 2021-08-05 03:12:47 label_models:200] Input_signal: tensor([-7.2640e+00, -5.2837e+00,  5.3772e-01, -8.6496e+00,  5.1234e+00,
            -9.9597e+00, -3.7034e+00, -1.6893e-01,  1.1892e+00, -3.2955e-01,
            -5.0978e+01,  2.5176e+00,  3.3735e+00, -8.2932e+00, -5.8100e-02,
             8.9195e-01, -7.8204e+00,  2.8190e+00,  2.1965e+00, -4.5658e+00,
             4.6412e+00,  7.7891e-01, -8.7750e+00, -3.3856e-01, -1.5517e+00,
             1.3796e+00, -3.7535e+01, -1.6224e-01, -5.6561e+00,  4.7903e-01,
             2.5053e+00, -1.1971e+01,  4.5067e+00, -7.0030e+00, -8.0738e+00,
             5.4799e+00, -5.5992e+01,  1.3271e+00, -5.6283e+00, -7.8360e+00,
            -3.5651e+00,  4.0000e+00, -7.0020e+00,  1.6409e+00, -3.5048e-01,
             2.2463e+00, -5.0571e+00, -1.6083e+00, -6.0765e+01,  4.7173e-02,
             5.3641e+00,  5.8133e+00,  3.2504e+01,  9.3649e-01,  1.5709e+01,
             1.8597e+00, -2.1429e+01, -2.7246e-01, -1.8408e+00,  5.6009e+00,
             5.2642e+00,  1.4973e+00,  4.6505e+00,  1.5815e+00], device='cuda:0')
[NeMo I 2021-08-05 03:12:47 label_models:201] Processed signal: tensor([-0.0238, -0.1374, -0.0758, -0.0362, -0.0437, -0.0189, -0.0039, -0.0015,
            -0.1398, -0.0183, -0.0130, -0.1774, -0.0230, -0.0103, -0.0147, -0.0155,
            -0.0143, -0.0106, -0.0170, -0.0085, -0.0119, -0.0261, -0.0108, -0.0110,
            -0.0078, -0.0209, -0.0518, -0.0134, -0.0800, -0.0375, -0.0252, -0.0052,
            -0.0038, -0.0338, -0.1895, -0.0297,  0.0006, -0.0195, -0.0073, -0.0771,
            -0.0071, -0.0300, -0.0399, -0.0247, -0.2608, -0.0256, -0.0236, -0.0480,
            -0.0120, -0.0285, -0.0789, -0.0340, -0.0088,  0.0015, -0.0135, -0.0242,
            -0.0280, -0.0178, -0.0084, -0.0225, -0.0489, -0.0181, -0.0512, -0.0230],
           device='cuda:0')
[NeMo I 2021-08-05 03:12:47 label_models:202] Encoded: tensor([775.9235, 761.7243, 814.4641, 779.1350, 728.0189, 749.5628, 742.4701,
            738.1171, 804.6092, 752.2346, 722.1220, 811.1537, 756.2740, 741.3098,
            787.4878, 747.1573, 753.1722, 791.8665, 829.0006, 767.3861, 721.6051,
            744.1120, 740.8286, 764.8818, 770.6669, 769.8176, 759.4538, 747.7944,
            821.2345, 730.4551, 735.5352, 760.4753, 746.2045, 791.3550, 773.8812,
            825.3999, 725.9216, 758.0412, 770.6284, 784.1265, 753.4616, 770.6104,
            762.1423, 768.8573, 832.8608, 805.1642, 817.6407, 758.5172, 747.1414,
            771.8086, 759.2638, 774.9775, 789.0172, 748.1165, 811.1917, 751.2094,
            751.8005, 750.8981, 745.0760, 742.4553, 734.7489, 785.9377, 757.0728,
            758.7659], device='cuda:0')
[NeMo I 2021-08-05 03:12:47 label_models:203] logits: tensor([ 837.9282,  933.0463,  904.5047, 1146.3391, 1270.3425,  829.1025,
             939.2279,  859.5983, 1080.9069, 1055.0779, 1325.0914,  895.8984,
             928.6825,  970.0303, 1194.4274, 1291.1360,  892.8967,  919.5685,
             620.6583,  847.6467, 1311.5002,  911.1189,  973.6673,  891.4790,
            1165.7302,  800.3695, 1237.9429, 1332.2549,  925.8285, 1257.8645,
             913.4201, 1279.2789,  915.7543,  866.2889, 1012.7273, 1171.7438,
             831.6862,  872.0796,  901.1927,  853.4565,  803.9324,  842.8481,
             868.1702,  890.2934, 1128.2134,  783.4645,  770.2278,  903.2722,
            1225.4275, 1173.8964,  966.3867,  953.6735, 1124.6262, 1163.0758,
            1162.4150,  924.2391,  885.0515,  849.4289, 1309.8447,  884.9170,
             856.6614, 1206.6677,  880.4076,  914.4090], device='cuda:0',
           grad_fn=<SumBackward1>) --------- embs: tensor([ -15.6109,   -6.5232,   -3.1432,   28.2739,   26.4231,  -22.0438,
              -8.6381,  -13.7220,   12.4318,   12.8998,   26.3078,  -14.2675,
              -7.4201,   -8.2558,   44.7105,   34.8373,  -16.3274,  -14.7989,
            -192.9999,  -15.0163,   41.0528,  -13.1249,   -7.0137,  -14.6647,
              28.7706,  -15.4204,   32.1244,   32.0851,   -3.4967,   26.7447,
             -10.4239,   40.4478,  -13.6846,   -9.6578,   -4.6466,   70.1545,
             -15.0264,  -10.8492,  -12.8593,  -19.7867,  -18.7421,  -17.2652,
             -15.7453,  -11.6013,   37.4433,  -13.1238,  -16.3968,  -15.6007,
              21.0131,   44.6906,   -4.6221,   -9.0381,   37.1265,   17.0363,
              40.0020,  -12.1931,  -18.4307,  -18.2255,   26.2079,  -10.5125,
             -12.6007,   36.1346,  -18.8209,  -14.5773], device='cuda:0',
           grad_fn=<SumBackward1>)

Epoch 10:  13%|█▎        | 809/6357 [08:56<1:01:20,  1.51it/s, loss=inf, v_num=4-35][NeMo I 2021-08-05 03:12:47 audio_to_label:350] Features type: 1.312255859375 ----- torch.Size([60799])

[NeMo I 2021-08-05 03:12:47 label_models:200] Input_signal: tensor([-7.3667e+00,  4.3621e+00, -7.8351e+00, -6.1382e+00,  1.1234e+00,
            -7.8352e+00,  3.8215e+01, -6.5747e+00,  1.4901e+00, -6.3032e+00,
             1.6229e+00,  4.3975e+00,  6.8592e-02,  4.0246e-01, -7.6963e-01,
             6.6508e-01,  1.0280e+00,  7.9978e+00,  8.3737e-01, -9.2365e-01,
             5.5330e+00,  1.1063e+01,  1.4176e+00, -8.3671e-01,  5.8054e-01,
            -7.1677e+00, -2.1789e-01, -8.1073e-01,  7.7098e-01, -7.3476e+00,
             4.6917e-01,  2.5451e+00,  3.1671e-01,  2.3164e-01, -8.6706e+00,
             2.4546e+00, -6.9703e+00, -6.1533e+00, -4.2260e-01,  1.6372e-01,
             1.8021e-01, -3.5878e-01,  1.6875e+00,  2.2597e+00,  5.2333e-01,
            -7.4447e+00, -1.8976e-01,  1.5130e+00,  2.7506e+00,  1.2589e+01,
            -3.2763e-01, -3.1738e-01, -6.3092e+00, -1.4495e+01,  2.2881e+00,
            -1.5554e+00,  1.6894e+00, -7.2745e+00, -2.5398e-02, -7.3677e+00,
             2.1140e+00,  2.3412e+00,  1.7505e+00,  7.2517e+01], device='cuda:0')
[NeMo I 2021-08-05 03:12:47 label_models:201] Processed signal: tensor([-0.0110, -0.0172, -0.0152, -0.0177, -0.0064, -0.0467, -0.0036, -0.0183,
            -0.0460, -0.0193, -0.0357, -0.0074, -0.0592, -0.0272, -0.0161, -0.0178,
            -0.0515, -0.0169, -0.0309, -0.0081, -0.0219, -0.0155, -0.0187, -0.0715,
            -0.0151, -0.0398, -0.0326, -0.0446, -0.0133, -0.0278, -0.1472, -0.0245,
            -0.0451, -0.0238, -0.0154, -0.0143, -0.0107, -0.0138, -0.0198, -0.1016,
            -0.0357, -0.0269,  0.0035, -0.0332, -0.0304, -0.0137, -0.0897, -0.1263,
            -0.0245, -0.0462, -0.0238, -0.0120, -0.0258, -0.0222, -0.0277, -0.0139,
            -0.0184, -0.0076, -0.0403, -0.0533, -0.0268, -0.0101, -0.1688, -0.0063],
           device='cuda:0')
[NeMo I 2021-08-05 03:12:47 label_models:202] Encoded: tensor([688.4766, 715.8893, 689.5655, 708.5850, 704.4224, 773.0853, 721.6960,
            727.1276, 711.0944, 678.0993, 743.5980, 694.5203, 724.0637, 684.3538,
            727.6652, 710.1851, 701.5847, 704.3207, 721.2671, 691.4686, 685.7607,
            713.0606, 707.0889, 704.0615, 708.6085, 682.6095, 693.8227, 732.7479,
            745.0822, 692.2618, 787.9812, 716.8567, 710.3370, 706.4578, 692.3884,
            706.5530, 676.1014, 687.0576, 779.3427, 700.9290, 758.4126, 697.1790,
            745.2277, 684.4161, 705.5687, 674.8506, 760.0540, 811.5912, 685.9936,
            705.9686, 707.3126, 689.3103, 687.7748, 761.4349, 718.7142, 698.1862,
            697.4583, 693.9688, 697.5015, 709.8937, 739.5364, 727.1003, 721.4929,
            701.9783], device='cuda:0')
[NeMo I 2021-08-05 03:12:47 label_models:203] logits: tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
            nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
            nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
           device='cuda:0', grad_fn=<SumBackward1>) --------- embs: tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
            nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
            nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
           device='cuda:0', grad_fn=<SumBackward1>)

Epoch 10:  13%|█▎        | 810/6357 [08:56<1:01:16,  1.51it/s, loss=inf, v_num=4-35]
Epoch 10:  13%|█▎        | 810/6357 [08:56<1:01:16,  1.51it/s, loss=nan, v_num=4-35][NeMo I 2021-08-05 03:12:48 audio_to_label:350] Features type: -1.804046630859375 ----- torch.Size([28479])

Environment overview (please complete the following information)

  • I used docker of NeMo

Additional context

GPU: DGX-1

hello, have fixed this problem yet? i got inf loss too.