Loss has NaN values - Githubissues

arthur-s commented 3 years ago

Hello, I try to create a model using public_series_1 from Russian dataset open_stt. I use this recipe. (transformer-ctc) I train using CPU. My data using wordpieces:

head train.lst:
4c877e7eb451 /home/s/dataset/converted/public_series_1/16000/4c877e7eb451.wav 2.64 доброе утро не уверена
eeee5f202f2b /home/s/dataset/converted/public_series_1/16000/eeee5f202f2b.wav 3.88 мне надо в мэрию отвезёшь
...

head tokens:
_органи
_язы
пор
_исто
_александ
...

head lexicon:
так     _так
внеочередное    _в не о че ре д ное
бояться _бо я ться
налёта  _на л ё та
получателя      _полу ча теля
закручивают     _за к ру чи ва ю т
...

Audio sample info:
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:02.64 = 42240 samples ~ 198 CDDA sectors
File Size      : 84.5k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

Arch file is https://github.com/flashlight/wav2letter/blob/master/recipes/sota/2019/am_arch/am_transformer_ctc.arch

Train file:

s@ml4:~/dataset/converted/public_series_1$ cat train.cfg
--datadir=/home/s/dataset/converted/
--train=public_series_1/train.lst
# --valid=public_youtube700_val/train.lst
--tokens=/home/s/dataset/converted/public_series_1/tokens_wp.txt
--arch=/home/s/dataset/converted/public_series_1/am.arch
--rundir=/home/s/dataset/converted/public_series_1/saved
--lexicon=/home/s/dataset/converted/public_series_1/lexicon_wp.lst
--iter=10
--alsologtostderr=true

--criterion=ctc
--mfsc
--usewordpiece=true
--wordseparator=_
--lr=0.01
--lrcrit=0.01
--nthread=8
--batchsize=5
--linseg=0
--momentum=0.0
--maxgradnorm=1.0
--onorm=target
--sqnorm
--filterbanks=80
--minisz=200
--mintsz=2
--warmup=32000
--saug_start_update=32000
--lr_decay=180
--lr_decay_step=40

Train log:

(py38) s@ml4:~/dataset/converted/public_series_1$ time fl_asr_train train --flagsfile=train.cfg
I0720 06:55:39.160804 84077 Train.cpp:195] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --ipl_maxisz=1.7976931348623157e+308; --ipl_maxtsz=9223372036854775807; --ipl_minisz=0; --ipl_mintsz=0; --ipl_relabel_epoch=10000000; --ipl_relabel_ratio=1; --ipl_seed_model_wer=-1; --ipl_use_existing_pl=false; --unsup_datadir=; --unsup_train=; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=/home/s/dataset/converted/public_series_1/am.arch; --attention=content; --attentionthreshold=2147483647; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batching_max_duration=0; --batching_strategy=none; --batchsize=5; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/home/s/dataset/converted/; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=false; --encoderdim=0; --eosscore=0; --everstoredb=false; --features_type=mfsc; --fftcachesize=1; --filterbanks=80; --fl_amp_max_scale_factor=32000; --fl_amp_scale_factor=4096; --fl_amp_scale_factor_update_interval=2000; --fl_amp_use_mixed_precision=false; --fl_benchmark_mode=true; --fl_log_level=; --fl_log_mem_ops_interval=0; --fl_optim_mode=; --fl_vlog_level=0; --flagsfile=train.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --highfreqfilterbank=-1; --inputfeeding=false; --isbeamdump=false; --iter=10; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/home/s/dataset/converted/public_series_1/lexicon_wp.lst; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --lmweight_high=4; --lmweight_low=0; --lmweight_step=0.20000000000000001; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lowfreqfilterbank=0; --lr=0.01; --lr_decay=180; --lr_decay_step=40; --lrcosine=false; --lrcrit=0.01; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=1; --maxload=-1; --maxrate=10; --maxsil=50; --maxword=-1; --melfloor=1; --mfcccoeffs=13; --minrate=3; --minsil=0; --momentum=0; --netoptim=sgd; --nthread=8; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --pctteacherforcing=100; --pcttraineval=100; --pretrainWindow=0; --replabel=0; --reportiters=0; --rightWindowSize=50; --rndv_filepath=; --rundir=/home/s/dataset/converted/public_series_1/saved; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=32000; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=; --seed=0; --sfx_config=; --sfx_start_update=2147483647; --show=false; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=9223372036854775807; --surround=; --test=; --tokens=/home/s/dataset/converted/public_series_1/tokens_wp.txt; --train=public_series_1/train.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=public_youtube700_val/train.lst; --validbatchsize=-1; --warmup=32000; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=true; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=false; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0720 06:55:39.161195 84077 Train.cpp:196] Experiment path: /home/s/dataset/converted/public_series_1/saved
I0720 06:55:39.161202 84077 Train.cpp:197] Experiment runidx: 1
I0720 06:55:39.162060 84077 Train.cpp:276] Number of classes (network): 935
I0720 06:55:39.195294 84077 Train.cpp:283] Number of words: 19941
I0720 06:55:39.283974 84077 Train.cpp:385] Loading architecture file from /home/s/dataset/converted/public_series_1/am.arch
I0720 06:55:42.778986 84077 Train.cpp:464] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> output]
        (0): View (-1 1 80 0)
        (1): WeightNorm (Conv2D (80->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias), 3)
        (2): GatedLinearUnit (2)
        (3): Dropout (0.200000)
        (4): Pool2D-max (1x1, 2,1, 0,0)
        (5): WeightNorm (Conv2D (512->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias), 3)
        (6): GatedLinearUnit (2)
        (7): Dropout (0.200000)
        (8): Pool2D-max (1x1, 2,1, 0,0)
        (9): WeightNorm (Conv2D (512->2048, 3x1, 1,1, SAME,0, 1, 1) (with bias), 3)
        (10): GatedLinearUnit (2)
        (11): Dropout (0.200000)
        (12): Pool2D-max (1x1, 2,1, 0,0)
        (13): Reorder (2,0,3,1)
        (14): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (15): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (16): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (17): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (18): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (19): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (20): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (21): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (22): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (23): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (24): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (25): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (26): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (27): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (28): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (29): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (30): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (31): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (32): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (33): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (34): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (35): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (36): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (37): Transformer (nHeads: 4), (pDropout: 0.2), (pLayerdrop: 0.2), (bptt: 460), (useMask: 0), (preLayerNorm: 0)
        (38): Dropout (0.200000)
        (39): Linear (1024->935) (with bias)
I0720 06:55:42.779211 84077 Train.cpp:465] [Network Params: 313567239]
I0720 06:55:42.779273 84077 Train.cpp:466] [Criterion] ConnectionistTemporalClassificationCriterion
I0720 06:55:42.779314 84077 Train.cpp:500] [Network Optimizer] SGD
I0720 06:55:42.779321 84077 Train.cpp:501] [Criterion Optimizer] SGD
I0720 06:55:42.809473 84077 Train.cpp:1034] Shuffling trainset
I0720 06:55:43.071627 84077 Train.cpp:1041] Epoch 1 started!
F0720 06:56:04.609699 84077 Train.cpp:1109] Loss has NaN values. Samples - e94f873debc7,035c4c54df26,1d048b7a6d90,71405b767d9c,5270b9b73fea
*** Check failure stack trace: ***
    @     0x7fa7309371c3  google::LogMessage::Fail()
    @     0x7fa73093c25b  google::LogMessage::SendToLog()
    @     0x7fa730936ebf  google::LogMessage::Flush()
    @     0x7fa7309376ef  google::LogMessageFatal::~LogMessageFatal()
    @     0x55e7548c7929  _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_INS0_3app3asr17SequenceCriterionEES_INS0_7DatasetEES_INS0_19FirstOrderOptimizerEESA_ddblE4_clES2_S6_S8_SA_SA_ddbl
    @     0x55e7548189a5  main
    @     0x7fa71d8c80b3  __libc_start_main
    @     0x55e7548bef7e  _start
Aborted (core dumped)

Help me please with it. I tried to set lr to 0.000001, it didn't help. cc: @tlikhomanenko

tlikhomanenko commented 3 years ago

Could you try another optimizer? adagrad instead of sgd (sgd can be very trickier for transformer models).

arthur-s commented 3 years ago

No success. I tried adagrad, and I tried 4 different datasets, including Russian Librespeech., and tried another methods from sota/2019: TDS, ResNet. In all cases I get loss has NaN values, I think I do something wrong in train data preparation (lexicon, tokens). I'll try to reproduce your results with English Librespeech, and share my results.

arthur-s commented 3 years ago

@tlikhomanenko , I can see that am.arch files for Librespeech and Librevox are so different, e.g this and this. Is there any guides how to choose values for layers (e.g in TR 768 3072 4 460 0.2 0.2 values are 768, 3072, etc.), and choose amount of layers?

tlikhomanenko commented 3 years ago

Librivox arch should work for Librispeech too. Mainly either you have deeper NN and shorter in width or shallow NN with larger width.

joazoa commented 3 years ago

I am experiencing NAN when using mixedprecision only. The same datasets work well with all other architectures and |I do not not have NAN problems with mixedprecision disabled.

I was able to use mixedprecision before the move from w2l to flashlight. The NAN issues happen suddenly, when the model is almost completely trained.

I have tried different servers, different settings for LR, different maxgradnorm, without success. Do you have any other suggestions I could try ? (I think it's a bug though as it didn't happen with earlier versions)

I am trying the slimIPL recipe now, which has an identical architecture other than the dropout, i do not have nan values with mixedprecision there so far.

Could dynamic or randdynamic be related ?

arthur-s commented 3 years ago

I'm able to run training (using GPU), and don't have Loss has NaN values anymore, but my WER doesn't downgrade, keep remaining around 99-101%.

tlikhomanenko commented 3 years ago

@joazoa

I am experiencing NAN when using mixedprecision only. The same datasets work well with all other architectures and |I do not not have NAN problems with mixedprecision disabled.

I was able to use mixedprecision before the move from w2l to flashlight. The NAN issues happen suddenly, when the model is almost completely trained.

I have tried different servers, different settings for LR, different maxgradnorm, without success. Do you have any other suggestions I could try ? (I think it's a bug though as it didn't happen with earlier versions)

I am trying the slimIPL recipe now, which has an identical architecture other than the dropout, i do not have nan values with mixedprecision there so far.

Could dynamic or randdynamic be related ?

I also faced the issue at the end of training, but only for some models. Could be a bug in mixed precision, yep. I debugged a lot on the way we do scaling - no success. What I do - several updates with fp32 and then back to fp16.

Thanks for trying slimIPL =) Let me know if you have any troubles with it.

@arthur-s could you post the log?

arthur-s commented 3 years ago

Sorry, logs are already lost, @tlikhomanenko, sorry. I'll continue my experiments later (I think one of the reasons may be that the dataset I used is not accurate, and need to find better one).

joazoa commented 3 years ago

@mtmd Do you have any suggestions on how we could troubleshoot this ? Only happens with transformer (for me at least), happens sporadically (i have the impression mostly with low LER/WER), same datasets work fine with conformer + mixed precision and used to work fine with wav2letter 0.2 AS @tlikhomanenko says, it works to continue training without mp for a while then reenable it. I also think that in my case, it will depend on the previous saved model. E.g. once it gives a NAN, no amount of retries in mixed precision will resolve the issue.

mtmd commented 3 years ago

@joazoa I might be able to help. However, I need to reproduce it first. Can you please provide a detailed instruction (+ the corresponding recipe) for reproducing the issue?

joazoa commented 3 years ago

@mtmd I think I may be able to send you a saved model that will cause a NAN on the next iteration (I will need a couple of days to make one though). Can I PM you with a download link ?

joazoa commented 3 years ago

So far I have not been able to reproduce it again, but I am running the branch that has the decoder batching. I will revert and try again.

joazoa commented 3 years ago

When I continue with mixedprecision enabled, I get NAN before the next reportiters. When i continue with mixedprecision disabled, the TER/WER will recover gradually. Continuing the model that was saved before it had a TER of 100, with mixedprecision enabled, works as well.

joazoa commented 3 years ago

@mtmd last night it happened for me was on the rasr large transformer. Dataset should not matter.

I did one run with reportiters=1 while on mixedprecision=false and then reverted back to mixed precision without issues (12 hours since). I could give you the model that was saved just before that. (I think the bad state is saved somehow). Update, it does not always fix it with 1 reportiter without mixedprecision :/

When i continue the broken model with mixed precision enabled, I can get it to give NAN values in ~20x reportiters of 1.

mtmd commented 3 years ago

@joazoa Thank you for sharing all these details. I am interested in reproducing this bug, and that's the first step for fixing it anyways.

Can I PM you with a download link?

Unfortunately, I cannot work with a model that's not in the public domain.

Have you tried reproducing it with a public dataset?

joazoa commented 3 years ago

I will try to make a checkpoint with only public data and get back to you as soon as I get it to happen. (Training is sometimes stable for days, so could take a while). @mtmd In case you have access to more / faster hardware, i think continuing the 300m rasr with small reportiters will trigger ibrispeech will trigger it. )

joonaskriisk commented 3 years ago

Hi, hijacking this thread with the same error as I believe/think/hope the root cause is the same.

Using Dockerfile-CUDA as the Flashlight installation together with FinetuneCTC data. Finetuning exactly like done in notebook works perfectly, while trying to train from scratch fails with "Loss has NaN values".

Example: Input:

root@022ec3f65e04:/data/finetune_ctc# /root/flashlight/build/bin/asr/fl_asr_train train \
> --datadir ami_limited_supervision \
> --train train_10min_0.lst,train_10min_1.lst,train_10min_2.lst,train_10min_3.lst,train_10min_4.lst,train_10min_5.lst,train_9hr.lst \
> --valid dev:dev.lst \
> --arch arch.txt \
> --tokens tokens.txt \
> --lexicon lexicon.txt \
> --rundir checkpoint \
> --lr 0.025 \
> --netoptim sgd \
> --momentum 0.8 \
> --reportiters 1000 \
> --lr_decay 100 \
> --lr_decay_step 50 \
> --iter 25000 \
> --batchsize 4 \
> --warmup 0

Output:

I1028 19:35:30.606153 63008 CachingMemoryManager.cpp:114 CachingMemoryManager recyclingSizeLimit_=18446744073709551615 (16777216.00 TiB) splitSizeLimit_=18446744073709551615 (16777216.00 TiB)
F1028 19:35:50.326356    22 Train.cpp:1116] Loss has NaN values. Samples - TS3009b_H01_MTD035UID_1103.75_1106.7,ES2005c_H03_FEE019_295.49_298.44,IS1003d_H03_MIO023_291.12_294.07,ES2016d_H02_MEE063_316.92_319.87
*** Check failure stack trace: ***
    @     0x7f69e02041c3  google::LogMessage::Fail()
    @     0x7f69e020925b  google::LogMessage::SendToLog()
    @     0x7f69e0203ebf  google::LogMessage::Flush()
    @     0x7f69e02046ef  google::LogMessageFatal::~LogMessageFatal()
    @     0x55e974785409  _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_INS0_3app3asr17SequenceCriterionEES_INS0_7DatasetEES_INS0_19FirstOrderOptimizerEESA_ddblE4_clES2_S6_S8_SA_SA_ddbl
    @     0x55e9746d4d52  main
    @     0x7f69748e40b3  __libc_start_main
    @     0x55e97477cbee  _start
Aborted (core dumped)

Perhaps any clue? I've also tried using my own data with my own training parameters and arch file but the error message is same.

joazoa commented 3 years ago

I don't think this will be the same issue, it's easy to verify by disabling the mixed precision. If without mixed precision all is fine, the issue is the same.

As this seems to be happening pretty much immediately for you, i wonder if you should not reduce the learning rate, or add a warmup period instead.

flashlight / flashlight

Loss has NaN values #686