Open arthur-s opened 3 years ago
Could you try another optimizer? adagrad instead of sgd (sgd can be very trickier for transformer models).
No success. I tried adagrad, and I tried 4 different datasets, including Russian Librespeech., and tried another methods from sota/2019: TDS, ResNet. In all cases I get loss has NaN values
, I think I do something wrong in train data preparation (lexicon, tokens). I'll try to reproduce your results with English Librespeech, and share my results.
Librivox arch should work for Librispeech too. Mainly either you have deeper NN and shorter in width or shallow NN with larger width.
I am experiencing NAN when using mixedprecision only. The same datasets work well with all other architectures and |I do not not have NAN problems with mixedprecision disabled.
I was able to use mixedprecision before the move from w2l to flashlight. The NAN issues happen suddenly, when the model is almost completely trained.
I have tried different servers, different settings for LR, different maxgradnorm, without success. Do you have any other suggestions I could try ? (I think it's a bug though as it didn't happen with earlier versions)
I am trying the slimIPL recipe now, which has an identical architecture other than the dropout, i do not have nan values with mixedprecision there so far.
Could dynamic or randdynamic be related ?
I'm able to run training (using GPU), and don't have Loss has NaN values
anymore, but my WER doesn't downgrade, keep remaining around 99-101%.
@joazoa
I am experiencing NAN when using mixedprecision only. The same datasets work well with all other architectures and |I do not not have NAN problems with mixedprecision disabled.
I was able to use mixedprecision before the move from w2l to flashlight. The NAN issues happen suddenly, when the model is almost completely trained.
I have tried different servers, different settings for LR, different maxgradnorm, without success. Do you have any other suggestions I could try ? (I think it's a bug though as it didn't happen with earlier versions)
I am trying the slimIPL recipe now, which has an identical architecture other than the dropout, i do not have nan values with mixedprecision there so far.
Could dynamic or randdynamic be related ?
I also faced the issue at the end of training, but only for some models. Could be a bug in mixed precision, yep. I debugged a lot on the way we do scaling - no success. What I do - several updates with fp32 and then back to fp16.
Thanks for trying slimIPL =) Let me know if you have any troubles with it.
@arthur-s could you post the log?
Sorry, logs are already lost, @tlikhomanenko, sorry. I'll continue my experiments later (I think one of the reasons may be that the dataset I used is not accurate, and need to find better one).
@mtmd Do you have any suggestions on how we could troubleshoot this ? Only happens with transformer (for me at least), happens sporadically (i have the impression mostly with low LER/WER), same datasets work fine with conformer + mixed precision and used to work fine with wav2letter 0.2 AS @tlikhomanenko says, it works to continue training without mp for a while then reenable it. I also think that in my case, it will depend on the previous saved model. E.g. once it gives a NAN, no amount of retries in mixed precision will resolve the issue.
@joazoa I might be able to help. However, I need to reproduce it first. Can you please provide a detailed instruction (+ the corresponding recipe) for reproducing the issue?
@mtmd I think I may be able to send you a saved model that will cause a NAN on the next iteration (I will need a couple of days to make one though). Can I PM you with a download link ?
So far I have not been able to reproduce it again, but I am running the branch that has the decoder batching. I will revert and try again.
I think I may have reproduced it with streaming convnets: (not 100% sure as it is still in warmup)I0922 14:46:53.018287 75496 Train.cpp:704] epoch: 3 | nupdates: 33000 | lr: 0.330000 | lrcriterion: 0.000000 | scale-factor: 6296.817665 | runtime: 00:10:13 | bch(ms): 613.91 | smp(ms): 0.00 | fwd(ms): 143.35 | crit-fwd(ms): 3.98 | bwd(ms): 426.95 | optim(ms): 10.01 | loss: 23.92922 | train-TER: 69.89 | train-WER: 88.73 | DS1-loss: 13.97142 | DS1-TER: 55.34 | DS1-WER: 78.76 | DS2.lst-loss: 16.55996 | DS2-TER: 57.94 | DS2-WER: 82.89 | DS3-loss: 15.30655 | DS3-TER: 45.79 | DS3-WER: 86.61 | vg-isz: 1165 | avg-tsz: 161 | max-tsz: 200331204 | avr-batchsz: 59.46 | hrs: 1540.70 | thrpt(sec/sec): 9034.81 | timestamp: 2021-09-22 14:46:53 | avgTER: 53.02 Memory Manager Stats Type: CachingMemoryManager Device: 0, Capacity: 23.69 GiB, Allocated: 19.36 GiB, Cached: 18.33 GiB Total native calls: 395(mallocs), 0(frees) I0922 14:57:21.641367 75496 Train.cpp:704] epoch: 3 | nupdates: 34000 | lr: 0.340000 | lrcriterion: 0.000000 | scale-factor: 8296.817665 | runtime: 00:10:16 | bch(ms): 616.16 | smp(ms): 0.00 | fwd(ms): 143.45 | crit-fwd(ms): 4.21 | bwd(ms): 427.42 | optim(ms): 9.89 | loss: 24.60810 | train-TER: 69.04 | train-WER: 88.41 | DS1-loss: 23.01042 | DS1-TER: 98.13 | DS1-WER: 100.00 | DS2-loss: 22.02237 | DS2-TER: 97.91 | DS2-WER: 100.00 | DS3-loss: 23.12349 | DS3-TER: 96.79 | DS3-WER: 100.00 | | avg-isz: 1230 | avg-tsz: 171 | max-tsz: 189995864 | avr-batchsz: 56.33 | hrs: 1539.93 | thrpt(sec/sec): 8997.16 | timestamp: 2021-09-22 14:57:21 | avgTER: 97.61 Memory Manager Stats Type: CachingMemoryManager Device: 0, Capacity: 23.69 GiB, Allocated: 19.36 GiB, Cached: 18.33 GiB Total native calls: 395(mallocs), 0(frees)
When I continue with mixedprecision enabled, I get NAN before the next reportiters. When i continue with mixedprecision disabled, the TER/WER will recover gradually. Continuing the model that was saved before it had a TER of 100, with mixedprecision enabled, works as well.
@mtmd last night it happened for me was on the rasr large transformer. Dataset should not matter.
I did one run with reportiters=1 while on mixedprecision=false and then reverted back to mixed precision without issues (12 hours since). I could give you the model that was saved just before that. (I think the bad state is saved somehow). Update, it does not always fix it with 1 reportiter without mixedprecision :/
When i continue the broken model with mixed precision enabled, I can get it to give NAN values in ~20x reportiters of 1.
@joazoa Thank you for sharing all these details. I am interested in reproducing this bug, and that's the first step for fixing it anyways.
Can I PM you with a download link?
Unfortunately, I cannot work with a model that's not in the public domain.
Have you tried reproducing it with a public dataset?
I will try to make a checkpoint with only public data and get back to you as soon as I get it to happen. (Training is sometimes stable for days, so could take a while). @mtmd In case you have access to more / faster hardware, i think continuing the 300m rasr with small reportiters will trigger ibrispeech will trigger it. )
Hi, hijacking this thread with the same error as I believe/think/hope the root cause is the same.
Using Dockerfile-CUDA as the Flashlight installation together with FinetuneCTC data. Finetuning exactly like done in notebook works perfectly, while trying to train from scratch fails with "Loss has NaN values".
Example: Input:
root@022ec3f65e04:/data/finetune_ctc# /root/flashlight/build/bin/asr/fl_asr_train train \
> --datadir ami_limited_supervision \
> --train train_10min_0.lst,train_10min_1.lst,train_10min_2.lst,train_10min_3.lst,train_10min_4.lst,train_10min_5.lst,train_9hr.lst \
> --valid dev:dev.lst \
> --arch arch.txt \
> --tokens tokens.txt \
> --lexicon lexicon.txt \
> --rundir checkpoint \
> --lr 0.025 \
> --netoptim sgd \
> --momentum 0.8 \
> --reportiters 1000 \
> --lr_decay 100 \
> --lr_decay_step 50 \
> --iter 25000 \
> --batchsize 4 \
> --warmup 0
Output:
I1028 19:35:30.606153 63008 CachingMemoryManager.cpp:114 CachingMemoryManager recyclingSizeLimit_=18446744073709551615 (16777216.00 TiB) splitSizeLimit_=18446744073709551615 (16777216.00 TiB)
F1028 19:35:50.326356 22 Train.cpp:1116] Loss has NaN values. Samples - TS3009b_H01_MTD035UID_1103.75_1106.7,ES2005c_H03_FEE019_295.49_298.44,IS1003d_H03_MIO023_291.12_294.07,ES2016d_H02_MEE063_316.92_319.87
*** Check failure stack trace: ***
@ 0x7f69e02041c3 google::LogMessage::Fail()
@ 0x7f69e020925b google::LogMessage::SendToLog()
@ 0x7f69e0203ebf google::LogMessage::Flush()
@ 0x7f69e02046ef google::LogMessageFatal::~LogMessageFatal()
@ 0x55e974785409 _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_INS0_3app3asr17SequenceCriterionEES_INS0_7DatasetEES_INS0_19FirstOrderOptimizerEESA_ddblE4_clES2_S6_S8_SA_SA_ddbl
@ 0x55e9746d4d52 main
@ 0x7f69748e40b3 __libc_start_main
@ 0x55e97477cbee _start
Aborted (core dumped)
Perhaps any clue? I've also tried using my own data with my own training parameters and arch file but the error message is same.
I don't think this will be the same issue, it's easy to verify by disabling the mixed precision. If without mixed precision all is fine, the issue is the same.
As this seems to be happening pretty much immediately for you, i wonder if you should not reduce the learning rate, or add a warmup period instead.
Hello, I try to create a model using
public_series_1
from Russian dataset open_stt. I use this recipe. (transformer-ctc) I train using CPU. My data using wordpieces:Arch file is https://github.com/flashlight/wav2letter/blob/master/recipes/sota/2019/am_arch/am_transformer_ctc.arch
Train file:
Train log:
Help me please with it. I tried to set lr to 0.000001, it didn't help. cc: @tlikhomanenko