FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
https://funaudiollm.github.io/
Apache License 2.0
6.54k stars 703 forks source link

从头训模型总是loss突然变成nan #652

Open lilinqin opened 2 weeks ago

lilinqin commented 2 weeks ago

从头训模型总是loss突然变成nan

2024-11-15 10:56:12,203 DEBUG TRAIN Batch 0/16500 loss 1.591806 acc 0.290056 lr 0.00082510 grad_norm 0.199882 rank 2 2024-11-15 10:56:12,203 DEBUG TRAIN Batch 0/16500 loss 1.591806 acc 0.290056 lr 0.00082510 grad_norm 0.199882 rank 7 2024-11-15 10:56:12,203 DEBUG TRAIN Batch 0/16500 loss 1.591806 acc 0.290056 lr 0.00082510 grad_norm 0.199882 rank 5 2024-11-15 10:56:12,203 DEBUG TRAIN Batch 0/16500 loss 1.591806 acc 0.290056 lr 0.00082510 grad_norm 0.199882 rank 1 2024-11-15 10:56:12,203 DEBUG TRAIN Batch 0/16500 loss 1.591806 acc 0.290056 lr 0.00082510 grad_norm 0.199882 rank 3 2024-11-15 10:56:12,203 DEBUG TRAIN Batch 0/16500 loss 1.591806 acc 0.290056 lr 0.00082510 grad_norm 0.199882 rank 4 2024-11-15 10:56:12,204 DEBUG TRAIN Batch 0/16500 loss 1.591806 acc 0.290056 lr 0.00082510 grad_norm 0.199882 rank 6 2024-11-15 10:56:12,204 DEBUG TRAIN Batch 0/16500 loss 1.591806 acc 0.290056 lr 0.00082510 grad_norm 0.199882 rank 0 2024-11-15 10:56:35,661 DEBUG TRAIN Batch 0/16600 loss nan acc 0.259458 lr 0.00083010 grad_norm nan rank 4 2024-11-15 10:56:35,661 DEBUG TRAIN Batch 0/16600 loss nan acc 0.259458 lr 0.00083010 grad_norm nan rank 2 2024-11-15 10:56:35,661 DEBUG TRAIN Batch 0/16600 loss nan acc 0.259458 lr 0.00083010 grad_norm nan rank 5 2024-11-15 10:56:35,661 DEBUG TRAIN Batch 0/16600 loss nan acc 0.259458 lr 0.00083010 grad_norm nan rank 3 2024-11-15 10:56:35,661 DEBUG TRAIN Batch 0/16600 loss nan acc 0.259458 lr 0.00083010 grad_norm nan rank 6 2024-11-15 10:56:35,662 DEBUG TRAIN Batch 0/16600 loss nan acc 0.259458 lr 0.00083010 grad_norm nan rank 1 2024-11-15 10:56:35,662 DEBUG TRAIN Batch 0/16600 loss nan acc 0.259458 lr 0.00083010 grad_norm nan rank 7 2024-11-15 10:56:35,663 DEBUG TRAIN Batch 0/16600 loss nan acc 0.259458 lr 0.00083010 grad_norm nan rank 0 2024-11-15 10:56:59,788 DEBUG TRAIN Batch 0/16700 loss 1.692054 acc 0.256676 lr 0.00083510 grad_norm nan rank 1 2024-11-15 10:56:59,788 DEBUG TRAIN Batch 0/16700 loss 1.692054 acc 0.256676 lr 0.00083510 grad_norm nan rank 5 2024-11-15 10:56:59,788 DEBUG TRAIN Batch 0/16700 loss 1.692054 acc 0.256676 lr 0.00083510 grad_norm nan rank 2 2024-11-15 10:56:59,788 DEBUG TRAIN Batch 0/16700 loss 1.692054 acc 0.256676 lr 0.00083510 grad_norm nan rank 4 2024-11-15 10:56:59,788 DEBUG TRAIN Batch 0/16700 loss 1.692054 acc 0.256676 lr 0.00083510 grad_norm nan rank 3 2024-11-15 10:56:59,788 DEBUG TRAIN Batch 0/16700 loss 1.692054 acc 0.256676 lr 0.00083510 grad_norm nan rank 6 2024-11-15 10:56:59,788 DEBUG TRAIN Batch 0/16700 loss 1.692054 acc 0.256676 lr 0.00083510 grad_norm nan rank 7 2024-11-15 10:56:59,790 DEBUG TRAIN Batch 0/16700 loss 1.692054 acc 0.256676 lr 0.00083510 grad_norm nan rank 0 2024-11-15 10:57:23,045 DEBUG TRAIN Batch 0/16800 loss 1.640502 acc 0.279475 lr 0.00084010 grad_norm nan rank 2 2024-11-15 10:57:23,045 DEBUG TRAIN Batch 0/16800 loss 1.640502 acc 0.279475 lr 0.00084010 grad_norm nan rank 4 2024-11-15 10:57:23,045 DEBUG TRAIN Batch 0/16800 loss 1.640502 acc 0.279475 lr 0.00084010 grad_norm nan rank 7 2024-11-15 10:57:23,045 DEBUG TRAIN Batch 0/16800 loss 1.640502 acc 0.279475 lr 0.00084010 grad_norm nan rank 1 2024-11-15 10:57:23,045 DEBUG TRAIN Batch 0/16800 loss 1.640502 acc 0.279475 lr 0.00084010 grad_norm nan rank 5 2024-11-15 10:57:23,045 DEBUG TRAIN Batch 0/16800 loss 1.640502 acc 0.279475 lr 0.00084010 grad_norm nan rank 3 2024-11-15 10:57:23,046 DEBUG TRAIN Batch 0/16800 loss 1.640502 acc 0.279475 lr 0.00084010 grad_norm nan rank 6 2024-11-15 10:57:23,048 DEBUG TRAIN Batch 0/16800 loss 1.640502 acc 0.279475 lr 0.00084010 grad_norm nan rank 0 2024-11-15 10:57:48,112 DEBUG TRAIN Batch 0/16900 loss nan acc 0.251761 lr 0.00084510 grad_norm nan rank 7 2024-11-15 10:57:48,112 DEBUG TRAIN Batch 0/16900 loss nan acc 0.251761 lr 0.00084510 grad_norm nan rank 5 2024-11-15 10:57:48,112 DEBUG TRAIN Batch 0/16900 loss nan acc 0.251761 lr 0.00084510 grad_norm nan rank 4 2024-11-15 10:57:48,112 DEBUG TRAIN Batch 0/16900 loss nan acc 0.251761 lr 0.00084510 grad_norm nan rank 3 2024-11-15 10:57:48,112 DEBUG TRAIN Batch 0/16900 loss nan acc 0.251761 lr 0.00084510 grad_norm nan rank 1 2024-11-15 10:57:48,112 DEBUG TRAIN Batch 0/16900 loss nan acc 0.251761 lr 0.00084510 grad_norm nan rank 6 2024-11-15 10:57:48,113 DEBUG TRAIN Batch 0/16900 loss nan acc 0.251761 lr 0.00084510 grad_norm nan rank 2 2024-11-15 10:57:48,114 DEBUG TRAIN Batch 0/16900 loss nan acc 0.251761 lr 0.00084510 grad_norm nan rank 0 2024-11-15 10:58:11,407 DEBUG TRAIN Batch 0/17000 loss nan acc 0.000000 lr 0.00085010 grad_norm nan rank 7 2024-11-15 10:58:11,407 DEBUG TRAIN Batch 0/17000 loss nan acc 0.000000 lr 0.00085010 grad_norm nan rank 5 2024-11-15 10:58:11,407 DEBUG TRAIN Batch 0/17000 loss nan acc 0.000000 lr 0.00085010 grad_norm nan rank 4 2024-11-15 10:58:11,407 DEBUG TRAIN Batch 0/17000 loss nan acc 0.000000 lr 0.00085010 grad_norm nan rank 6 2024-11-15 10:58:11,407 DEBUG TRAIN Batch 0/17000 loss nan acc 0.000000 lr 0.00085010 grad_norm nan rank

aluminumbox commented 2 weeks ago

试试注释掉--use-amp,你应该是自己准备的数据吧?