flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

specAugment with Transformer #762

Closed rajeevbaalwan closed 4 years ago

rajeevbaalwan commented 4 years ago

I am trying to achieve some good results on Libri 100 Hour data using transformer + CTC architecture provided in https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/sota/2019/librispeech/train_am_transformer_ctc.cfg

I have found this information in doc for sota Transformer model.

Transformer CTC training The model is trained with total batch size 128 for approximatively 320 epochs with Adadelta. There is a warmup stage: SpecAugment is activated only after warmup, and the learning rate is warmed up (linearly increased) over the first 32000 updates to 0.4. It is then divided by 2 at epoch 180, and then every 40 epochs. Last 10 epochs are done with lr=0.001.

I have few Doubts based on above paragraph

  1. why spec_aug flags and warmup flags are missing from provided config?
  2. Is there a way to decide how many epochs are enough for warmup based on dataset size ? is there any doc regarding specaug with wav2letter ?
  3. for sota warmup stage done till 32000 updates and specAug was used for complete training till end?
  4. For complete librispeech data with 128 complete batch size the total number of updates in 1 epoch = ~2197, then warmup stage consist of ~15 epochs(32000 updates) and the LR will 0.4 at this point? and LR between epoch 15-180 will be at 0.4 ?

I've found 2 params which might be relevant for warmup and specAug:

(warmup, 1, "the LR warmup parameter, in updates") (saug_start_update,-1,"Use SpecAugment starting at the update number inputted. -1 means no SpecAugment")

1.what value do i need to specify for warmup param? which is number of updates to warmup i.e 32000?

  1. Suppose if i want to start specAug at update 32000 then putting 32000 as saug_start_update's value works right?
tlikhomanenko commented 4 years ago
  • why spec_aug flags and warmup flags are missing from provided config?

We release support on warmup and specaug for transformer a bit late compared to recipe itself. In next week we are planning to release all models and other stuff from the latest version of the paper.

  • Is there a way to decide how many epochs are enough for warmup based on dataset size ? is there any doc regarding specaug with wav2letter ?

No any recipe for the number of epochs for warmup. You can set small warmup and check if model blow up or now. This depends also on learning rate. In case of blow up you increase warmup. Regarding specaug what documentation are you asking? Right now we support it from the command line without adding layer in the model and there is https://github.com/facebookresearch/wav2letter/blob/master/src/common/Defines.cpp#L90 which defines at which update specaug started to be used (here are specaug settings https://github.com/facebookresearch/wav2letter/blob/master/src/common/Defines.cpp#L136)

  • for sota warmup stage done till 32000 updates and specAug was used for complete training till end?

yep, after warmup is done we turn on specaug and use it till the end of training.

4. For complete librispeech data with 128 complete batch size the total number of updates in 1 epoch = ~2197, then warmup stage consist of ~15 epochs(32000 updates) and the LR will 0.4 at this point? and LR between epoch 15-180 will be at 0.4 ?

Yep, all is correct. during warmup lr increased linearly from 0 to 0.4 then it is constant 0.4 and then we decay it by 2 depending on the model.

I've found 2 params which might be relevant for warmup and specAug:

To reproduce the training you found correct parameters (see update on recipe this week too): yep, just set --warmup=32000 --saug_start_update=32000 which means warmup 32k updates and then start specaug.

rajeevbaalwan commented 4 years ago

@tlikhomanenko Thanks for clarifications. Tried with those parameters and it's working. But what do you mean by saying model will blow up is it loss exploding between training or just after warmup stage ? Will definitely checkout updates on recipe this week.

tlikhomanenko commented 4 years ago

Model can explode during warmup (this is for transformers).

rajeevbaalwan commented 4 years ago

Thanks @tlikhomanenko I'll close this issue as of now , will reopen if any issue occurs.

rajeevbaalwan commented 4 years ago

@tlikhomanenko Can specAug work with wav2vec representations instead of spectograms?

Also for this doc

Transformer CTC training The model is trained with total batch size 128 for approximatively 320 epochs with Adadelta. There is a warmup stage: SpecAugment is activated only after warmup, and the learning rate is warmed up (linearly increased) over the first 32000 updates to 0.4. It is then divided by 2 at epoch 180, and then every 40 epochs. Last 10 epochs are done with lr=0.001.

Now as per doc after 180 LR is divided by 2 at every 40 epochs if I do this way LR is 0.4 till epoch 180 LR is 0.2 from epoch 181-220 LR is 0.1 from epoch 221-260 LR is 0.05 from epoch 261-300 Last 10 epoch are at LR 0.001 i.e epoch 310 to 320

But what about epoch 300-310 ? Something is missing from documentation. Can you verify?

tlikhomanenko commented 4 years ago

Can specAug work with wav2vec representations instead of spectograms?

Yep, sure, it doesn't matter what features you have here.

But what about epoch 300-310 ? Something is missing from documentation. Can you verify?

300-310 it is done with 0.025. We just stopped model when don't see improvement and tried to finetune a bit with very small lr at the end.

xiaosdawn commented 4 years ago

Hi, sorry for bothering. I trained transformer_ctc model following sota/2019. The 001_log as:

epoch:        1 | nupdates:        35146 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:20:12 | bch(ms): 341.80 | smp(ms): 1.35 | fwd(ms): 115.18 | crit-fwd(ms): 8.68 | bwd(ms): 177.97 | optim(ms): 46.81 | loss:   40.02281 | train-TER: 88.82 | train-WER: 93.16 | dev-clean-loss:   20.26408 | dev-clean-TER: 60.21 | dev-clean-WER: 73.67 | dev-other-loss:   21.10677 | dev-other-TER: 65.98 | dev-other-WER: 79.89 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 101 | hrs:  960.40 | thrpt(sec/sec): 287.81
epoch:        2 | nupdates:        70292 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:19:30 | bch(ms): 340.60 | smp(ms): 1.27 | fwd(ms): 114.95 | crit-fwd(ms): 8.69 | bwd(ms): 177.65 | optim(ms): 46.29 | loss:   31.18885 | train-TER: 63.83 | train-WER: 78.63 | dev-clean-loss:   12.04277 | dev-clean-TER: 30.57 | dev-clean-WER: 46.37 | dev-other-loss:   14.23059 | dev-other-TER: 40.01 | dev-other-WER: 57.63 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 100 | hrs:  960.40 | thrpt(sec/sec): 288.82
epoch:        3 | nupdates:       105438 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:19:44 | bch(ms): 340.98 | smp(ms): 1.31 | fwd(ms): 114.94 | crit-fwd(ms): 8.65 | bwd(ms): 177.68 | optim(ms): 46.59 | loss:   23.19176 | train-TER: 44.56 | train-WER: 60.22 | dev-clean-loss:    7.79196 | dev-clean-TER: 16.91 | dev-clean-WER: 28.13 | dev-other-loss:   10.19768 | dev-other-TER: 26.75 | dev-other-WER: 41.08 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 101 | hrs:  960.40 | thrpt(sec/sec): 288.50
epoch:        4 | nupdates:       140584 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:19:29 | bch(ms): 340.57 | smp(ms): 1.30 | fwd(ms): 114.77 | crit-fwd(ms): 8.64 | bwd(ms): 177.41 | optim(ms): 46.62 | loss:   19.02595 | train-TER: 35.44 | train-WER: 50.09 | dev-clean-loss:    6.34031 | dev-clean-TER: 12.36 | dev-clean-WER: 22.22 | dev-other-loss:    8.79135 | dev-other-TER: 21.42 | dev-other-WER: 34.57 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 099 | hrs:  960.40 | thrpt(sec/sec): 288.85
epoch:        5 | nupdates:       175730 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:19:14 | bch(ms): 340.15 | smp(ms): 1.29 | fwd(ms): 114.67 | crit-fwd(ms): 8.63 | bwd(ms): 177.21 | optim(ms): 46.52 | loss:   16.96566 | train-TER: 31.50 | train-WER: 45.09 | dev-clean-loss:    5.44645 | dev-clean-TER: 10.42 | dev-clean-WER: 19.35 | dev-other-loss:    7.79601 | dev-other-TER: 18.52 | dev-other-WER: 30.30 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 101 | hrs:  960.40 | thrpt(sec/sec): 289.21
epoch:        6 | nupdates:       210876 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:19:08 | bch(ms): 339.97 | smp(ms): 1.29 | fwd(ms): 114.66 | crit-fwd(ms): 8.62 | bwd(ms): 177.19 | optim(ms): 46.38 | loss:   15.67413 | train-TER: 27.86 | train-WER: 40.86 | dev-clean-loss:    4.75529 | dev-clean-TER:  8.82 | dev-clean-WER: 17.38 | dev-other-loss:    7.03329 | dev-other-TER: 16.09 | dev-other-WER: 28.12 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 103 | hrs:  960.40 | thrpt(sec/sec): 289.36
epoch:        7 | nupdates:       246022 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:19:15 | bch(ms): 340.16 | smp(ms): 1.28 | fwd(ms): 114.65 | crit-fwd(ms): 8.62 | bwd(ms): 177.28 | optim(ms): 46.49 | loss:   14.89369 | train-TER: 26.17 | train-WER: 38.77 | dev-clean-loss:    4.76659 | dev-clean-TER:  8.09 | dev-clean-WER: 16.06 | dev-other-loss:    6.97389 | dev-other-TER: 15.56 | dev-other-WER: 26.65 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 100 | hrs:  960.40 | thrpt(sec/sec): 289.20
epoch:        8 | nupdates:       281168 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:17:52 | bch(ms): 337.80 | smp(ms): 1.12 | fwd(ms): 114.25 | crit-fwd(ms): 8.59 | bwd(ms): 176.61 | optim(ms): 45.42 | loss:   14.25666 | train-TER: 24.98 | train-WER: 37.12 | dev-clean-loss:    4.41068 | dev-clean-TER:  7.34 | dev-clean-WER: 14.95 | dev-other-loss:    6.37601 | dev-other-TER: 14.15 | dev-other-WER: 25.01 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 101 | hrs:  960.40 | thrpt(sec/sec): 291.22
epoch:        9 | nupdates:       316314 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:20:07 | bch(ms): 341.65 | smp(ms): 1.30 | fwd(ms): 114.91 | crit-fwd(ms): 8.64 | bwd(ms): 177.61 | optim(ms): 47.35 | loss:   13.73204 | train-TER: 24.32 | train-WER: 36.40 | dev-clean-loss:    4.19915 | dev-clean-TER:  7.00 | dev-clean-WER: 14.32 | dev-other-loss:    6.25162 | dev-other-TER: 13.64 | dev-other-WER: 24.17 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 100 | hrs:  960.40 | thrpt(sec/sec): 287.94
epoch:       10 | nupdates:       351460 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:21:06 | bch(ms): 343.31 | smp(ms): 1.36 | fwd(ms): 115.20 | crit-fwd(ms): 8.66 | bwd(ms): 178.11 | optim(ms): 48.13 | loss:   13.41283 | train-TER: 22.96 | train-WER: 34.70 | dev-clean-loss:    4.29421 | dev-clean-TER:  6.65 | dev-clean-WER: 13.88 | dev-other-loss:    6.21255 | dev-other-TER: 13.01 | dev-other-WER: 23.41 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 099 | hrs:  960.40 | thrpt(sec/sec): 286.54
epoch:       11 | nupdates:       386606 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:18:43 | bch(ms): 339.27 | smp(ms): 1.17 | fwd(ms): 114.52 | crit-fwd(ms): 8.61 | bwd(ms): 177.03 | optim(ms): 46.12 | loss:   13.27797 | train-TER: 22.35 | train-WER: 33.87 | dev-clean-loss:    4.29342 | dev-clean-TER:  6.63 | dev-clean-WER: 13.60 | dev-other-loss:    6.15515 | dev-other-TER: 12.98 | dev-other-WER: 23.13 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 105 | hrs:  960.40 | thrpt(sec/sec): 289.96
epoch:       12 | nupdates:       421752 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:17:34 | bch(ms): 337.29 | smp(ms): 1.07 | fwd(ms): 114.16 | crit-fwd(ms): 8.59 | bwd(ms): 176.52 | optim(ms): 45.13 | loss:   13.26192 | train-TER: 22.40 | train-WER: 34.09 | dev-clean-loss:    4.14542 | dev-clean-TER:  6.26 | dev-clean-WER: 13.42 | dev-other-loss:    6.14282 | dev-other-TER: 12.63 | dev-other-WER: 23.10 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 106 | hrs:  960.40 | thrpt(sec/sec): 291.66
epoch:       13 | nupdates:       456898 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:17:34 | bch(ms): 337.30 | smp(ms): 1.08 | fwd(ms): 114.19 | crit-fwd(ms): 8.60 | bwd(ms): 176.50 | optim(ms): 45.14 | loss:   13.19533 | train-TER: 22.31 | train-WER: 33.97 | dev-clean-loss:    4.30802 | dev-clean-TER:  6.22 | dev-clean-WER: 13.13 | dev-other-loss:    6.38784 | dev-other-TER: 12.45 | dev-other-WER: 22.41 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 101 | hrs:  960.40 | thrpt(sec/sec): 291.65
epoch:       14 | nupdates:       492044 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:18:32 | bch(ms): 338.95 | smp(ms): 1.16 | fwd(ms): 114.42 | crit-fwd(ms): 8.61 | bwd(ms): 176.96 | optim(ms): 45.98 | loss:   13.15719 | train-TER: 22.51 | train-WER: 34.24 | dev-clean-loss:    4.32588 | dev-clean-TER:  6.19 | dev-clean-WER: 12.97 | dev-other-loss:    6.17860 | dev-other-TER: 12.17 | dev-other-WER: 22.02 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 104 | hrs:  960.40 | thrpt(sec/sec): 290.23
epoch:       15 | nupdates:       527190 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:18:33 | bch(ms): 338.96 | smp(ms): 1.17 | fwd(ms): 114.44 | crit-fwd(ms): 8.61 | bwd(ms): 176.95 | optim(ms): 45.98 | loss:   13.05351 | train-TER: 21.52 | train-WER: 33.13 | dev-clean-loss:    4.28696 | dev-clean-TER:  5.98 | dev-clean-WER: 12.71 | dev-other-loss:    6.10256 | dev-other-TER: 11.96 | dev-other-WER: 21.82 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 105 | hrs:  960.40 | thrpt(sec/sec): 290.22
epoch:       16 | nupdates:       562336 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:18:28 | bch(ms): 338.84 | smp(ms): 1.15 | fwd(ms): 114.40 | crit-fwd(ms): 8.62 | bwd(ms): 176.94 | optim(ms): 45.92 | loss:   12.95347 | train-TER: 20.89 | train-WER: 32.18 | dev-clean-loss:    4.26626 | dev-clean-TER:  5.84 | dev-clean-WER: 12.40 | dev-other-loss:    6.09985 | dev-other-TER: 11.78 | dev-other-WER: 21.37 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 109 | hrs:  960.40 | thrpt(sec/sec): 290.32
epoch:       17 | nupdates:       597482 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:18:30 | bch(ms): 338.88 | smp(ms): 1.16 | fwd(ms): 114.41 | crit-fwd(ms): 8.61 | bwd(ms): 176.94 | optim(ms): 45.95 | loss:   12.89857 | train-TER: 21.38 | train-WER: 32.73 | dev-clean-loss:    4.19587 | dev-clean-TER:  5.99 | dev-clean-WER: 12.50 | dev-other-loss:    6.17563 | dev-other-TER: 12.03 | dev-other-WER: 21.70 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 101 | hrs:  960.40 | thrpt(sec/sec): 290.29
epoch:       18 | nupdates:       632628 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:18:05 | bch(ms): 338.17 | smp(ms): 1.13 | fwd(ms): 114.31 | crit-fwd(ms): 8.60 | bwd(ms): 176.75 | optim(ms): 45.57 | loss:   12.81981 | train-TER: 21.14 | train-WER: 32.39 | dev-clean-loss:    4.03204 | dev-clean-TER:  5.69 | dev-clean-WER: 12.23 | dev-other-loss:    5.97668 | dev-other-TER: 11.63 | dev-other-WER: 21.38 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 100 | hrs:  960.40 | thrpt(sec/sec): 290.89
epoch:       19 | nupdates:       667774 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:17:37 | bch(ms): 337.38 | smp(ms): 1.09 | fwd(ms): 114.23 | crit-fwd(ms): 8.60 | bwd(ms): 176.54 | optim(ms): 45.12 | loss:   12.75817 | train-TER: 21.30 | train-WER: 32.58 | dev-clean-loss:    3.98113 | dev-clean-TER:  5.47 | dev-clean-WER: 11.89 | dev-other-loss:    5.98956 | dev-other-TER: 11.51 | dev-other-WER: 21.28 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 100 | hrs:  960.40 | thrpt(sec/sec): 291.58

The trend is reasonable, but why is train-WER higher than dev-*-WER ? Is there something I missed ? THX.

rajeevbaalwan commented 4 years ago

did you used specAugment while training ?

xiaosdawn commented 4 years ago

yes. --warmup=32000, this should be secAugment I guess. Here is my train.cfg

--runname=am_transformer_ctc_librispeech
--rundir=/root/wav2letter-release-20200729/tutorials/1-librispeech_clean/librispeech_rundir
--archdir=/root/wav2letter-release-20200729/recipes/models/sota/2019
--arch=am_arch/am_transformer_ctc.arch
--tokensdir=/root/wav2letter-release-20200729/recipes/models/sota/2019/model_dst/am
--tokens=librispeech-train-all-unigram-10000.tokens
--lexicon=/root/wav2letter-release-20200729/recipes/models/sota/2019/model_dst/am/librispeech-train+dev-unigram-10000-nbest10.lexicon
--train=/app/lists/train-clean-100.lst,/app/lists/train-clean-360.lst,/app/lists/train-other-500.lst
--valid=dev-clean:/app/lists/dev-clean.lst,dev-other:/app/lists/dev-other.lst
--criterion=ctc
--mfsc
--usewordpiece=true
--wordseparator=_
--labelsmooth=0.05
--dataorder=output_spiral
--inputbinsize=25
--softwstd=4
--memstepsize=5000000
--pcttraineval=1
--pctteacherforcing=99
--sampletarget=0.01
--netoptim=adadelta
--critoptim=adadelta
--lr=0.4
--lrcrit=0.4
--linseg=0
--momentum=0.0
--maxgradnorm=1.0
--onorm=target
--sqnorm
--nthread=6
--batchsize=8
--filterbanks=80
--minisz=200
--mintsz=2
--enable_distributed
--warmup=32000
--saug_start_update=32000
--lr_decay=180
--lr_decay_step=40
rajeevbaalwan commented 4 years ago

@xiaosdawn due to specAugment your Train Wer is more than all validation set as in every epoch model is trained on newly augmented data making it difficult to overfit. This is normal if you are using specAugment. You can avoid specAugment but then there is chance that your model might overfit in that case your train Wer will be lowest among all validation sets.

xiaosdawn commented 4 years ago

OK, thank you @rajeevbaalwan . I'll keep it running and wait for more results.

xiaosdawn commented 4 years ago

As discussed above and my cfg, the model (transformer_ctc) is trained with total 281241 samples, and will have 35156 batches (batchsize=8). The warmup=32000. So when epoch is 2, the specAugment will be activated, and the learning rate is warmup up (linearly increased) to 0.4 (--lr=0.4). It is then keeping 0.4 till epoch 180 (--lr_decay=180). I'm not sure if this is specAugment stage from epoch 2 to epoch 180. Then learning rate is divided every 40 epoch (--lr_decay_step=40). But what is the decay factor? I noice there is a gamma parameter in Defines.cpp. I'm not sure if gamma controls learning rate decay in this recipe (transformer_ctc). If I understand something wrong, please let me know. THX.

tlikhomanenko commented 4 years ago

warmup jus operated with warmup stage when learning rate linearly increased to lr. To activate specaug you either use layer in the arch file (and specaug will be used from the very first update) or you could specify flag saug_start_update https://github.com/facebookresearch/wav2letter/blob/master/src/common/Defines.cpp#L90 to specify from which update specaug is applied.

Then learning rate is divided every 40 epoch (--lr_decay_step=40). But what is the decay factor? I noice there is a gamma parameter in Defines.cpp. I'm not sure if gamma controls learning rate decay in this recipe (transformer_ctc).

Decay factor is 2, so the lr will be divided by 2 every 40 epochs. For transformer_ctc we are not using gamma factor, only lr_decay_step and lr_decay.

xiaosdawn commented 4 years ago

Thanks @tlikhomanenko . So the decay factor is default 2 for transformer_ctc. How can I find it (decay factor) ?

tlikhomanenko commented 4 years ago

It hardcoded in the code here https://github.com/facebookresearch/wav2letter/blob/master/Train.cpp#L559, so you can add one more flags for this and replace 0.5 to the flag.

xiaosdawn commented 4 years ago

Thank you very much. I got it. @tlikhomanenko

tlikhomanenko commented 4 years ago

Closing the issue for now, seems we solved the main issue. Feel free to reopen or continue thread if it is needed.

xiaosdawn commented 4 years ago

I have been training transformer_ctc for days. train.cfg is described above. First, I stoped the training at the end of epoch 50 on 1 GPU. Then, I continue training the model from epoch 51 on 2 GPU. For now, epoch 211, the lr is 0.4 as the same as value specific in --lr. rajeevbaalwan described above:

LR is 0.4 till epoch 180 LR is 0.2 from epoch 181-220 LR is 0.1 from epoch 221-260 LR is 0.05 from epoch 261-300

I don't know if it's due to the number of gpus used for training, or it's due to the continue training. Maybe LR is 0.4 till epoch 230 (50+180).
BTW, training is normal. It's just a little confused about the lr decay. Looking forward to help. THX.

Edit: LR is 0.4 till epoch 219 LR is 0.2 from epoch 220-

Edit: LR is 0.4 till epoch 219 LR is 0.2 from epoch 220-259 LR is 0.1 from epoch 260- So, starting from epoch 220, LR is divided by 2 every 40 epoch. This is normal. But lr change is still a little confused epoch from 1 to 219. My settings as mentioned above:

As discussed above and my cfg, the model (transformer_ctc) is trained with total 281241 samples, and will have 35156 batches (batchsize=8). The warmup=32000. So when epoch is 2, the specAugment will be activated, and the learning rate is warmup up (linearly increased) to 0.4 (--lr=0.4). It is then keeping 0.4 till epoch 180 (--lr_decay=180).

If I missed something, please let me know. Hope to help sovling the puzzle. THX.

tlikhomanenko commented 4 years ago

@xiaosdawn we recently fixed (again) lr decay, it was really bogus bug, that decay happens not at lr_decay but lr_decay + lr_decay_step. Now it is fixed, see commit e7c4d174ab581ce28df7cd3518ad936eaa752cea.

xiaosdawn commented 4 years ago

That's wonderful. Thanks all of you. I'll try more.