Closed rajeevbaalwan closed 4 years ago
- why spec_aug flags and warmup flags are missing from provided config?
We release support on warmup and specaug for transformer a bit late compared to recipe itself. In next week we are planning to release all models and other stuff from the latest version of the paper.
- Is there a way to decide how many epochs are enough for warmup based on dataset size ? is there any doc regarding specaug with wav2letter ?
No any recipe for the number of epochs for warmup. You can set small warmup and check if model blow up or now. This depends also on learning rate. In case of blow up you increase warmup. Regarding specaug what documentation are you asking? Right now we support it from the command line without adding layer in the model and there is https://github.com/facebookresearch/wav2letter/blob/master/src/common/Defines.cpp#L90 which defines at which update specaug started to be used (here are specaug settings https://github.com/facebookresearch/wav2letter/blob/master/src/common/Defines.cpp#L136)
- for sota warmup stage done till 32000 updates and specAug was used for complete training till end?
yep, after warmup is done we turn on specaug and use it till the end of training.
4. For complete librispeech data with 128 complete batch size the total number of updates in 1 epoch = ~2197, then warmup stage consist of ~15 epochs(32000 updates) and the LR will 0.4 at this point? and LR between epoch 15-180 will be at 0.4 ?
Yep, all is correct. during warmup lr increased linearly from 0 to 0.4 then it is constant 0.4 and then we decay it by 2 depending on the model.
I've found 2 params which might be relevant for warmup and specAug:
To reproduce the training you found correct parameters (see update on recipe this week too): yep, just set --warmup=32000 --saug_start_update=32000
which means warmup 32k updates and then start specaug.
@tlikhomanenko Thanks for clarifications. Tried with those parameters and it's working. But what do you mean by saying model will blow up is it loss exploding between training or just after warmup stage ? Will definitely checkout updates on recipe this week.
Model can explode during warmup (this is for transformers).
Thanks @tlikhomanenko I'll close this issue as of now , will reopen if any issue occurs.
@tlikhomanenko Can specAug work with wav2vec representations instead of spectograms?
Also for this doc
Transformer CTC training The model is trained with total batch size 128 for approximatively 320 epochs with Adadelta. There is a warmup stage: SpecAugment is activated only after warmup, and the learning rate is warmed up (linearly increased) over the first 32000 updates to 0.4. It is then divided by 2 at epoch 180, and then every 40 epochs. Last 10 epochs are done with lr=0.001.
Now as per doc after 180 LR is divided by 2 at every 40 epochs if I do this way LR is 0.4 till epoch 180 LR is 0.2 from epoch 181-220 LR is 0.1 from epoch 221-260 LR is 0.05 from epoch 261-300 Last 10 epoch are at LR 0.001 i.e epoch 310 to 320
But what about epoch 300-310 ? Something is missing from documentation. Can you verify?
Can specAug work with wav2vec representations instead of spectograms?
Yep, sure, it doesn't matter what features you have here.
But what about epoch 300-310 ? Something is missing from documentation. Can you verify?
300-310 it is done with 0.025. We just stopped model when don't see improvement and tried to finetune a bit with very small lr at the end.
Hi, sorry for bothering. I trained transformer_ctc model following sota/2019.
The 001_log
as:
epoch: 1 | nupdates: 35146 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:20:12 | bch(ms): 341.80 | smp(ms): 1.35 | fwd(ms): 115.18 | crit-fwd(ms): 8.68 | bwd(ms): 177.97 | optim(ms): 46.81 | loss: 40.02281 | train-TER: 88.82 | train-WER: 93.16 | dev-clean-loss: 20.26408 | dev-clean-TER: 60.21 | dev-clean-WER: 73.67 | dev-other-loss: 21.10677 | dev-other-TER: 65.98 | dev-other-WER: 79.89 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 101 | hrs: 960.40 | thrpt(sec/sec): 287.81
epoch: 2 | nupdates: 70292 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:19:30 | bch(ms): 340.60 | smp(ms): 1.27 | fwd(ms): 114.95 | crit-fwd(ms): 8.69 | bwd(ms): 177.65 | optim(ms): 46.29 | loss: 31.18885 | train-TER: 63.83 | train-WER: 78.63 | dev-clean-loss: 12.04277 | dev-clean-TER: 30.57 | dev-clean-WER: 46.37 | dev-other-loss: 14.23059 | dev-other-TER: 40.01 | dev-other-WER: 57.63 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 100 | hrs: 960.40 | thrpt(sec/sec): 288.82
epoch: 3 | nupdates: 105438 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:19:44 | bch(ms): 340.98 | smp(ms): 1.31 | fwd(ms): 114.94 | crit-fwd(ms): 8.65 | bwd(ms): 177.68 | optim(ms): 46.59 | loss: 23.19176 | train-TER: 44.56 | train-WER: 60.22 | dev-clean-loss: 7.79196 | dev-clean-TER: 16.91 | dev-clean-WER: 28.13 | dev-other-loss: 10.19768 | dev-other-TER: 26.75 | dev-other-WER: 41.08 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 101 | hrs: 960.40 | thrpt(sec/sec): 288.50
epoch: 4 | nupdates: 140584 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:19:29 | bch(ms): 340.57 | smp(ms): 1.30 | fwd(ms): 114.77 | crit-fwd(ms): 8.64 | bwd(ms): 177.41 | optim(ms): 46.62 | loss: 19.02595 | train-TER: 35.44 | train-WER: 50.09 | dev-clean-loss: 6.34031 | dev-clean-TER: 12.36 | dev-clean-WER: 22.22 | dev-other-loss: 8.79135 | dev-other-TER: 21.42 | dev-other-WER: 34.57 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 099 | hrs: 960.40 | thrpt(sec/sec): 288.85
epoch: 5 | nupdates: 175730 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:19:14 | bch(ms): 340.15 | smp(ms): 1.29 | fwd(ms): 114.67 | crit-fwd(ms): 8.63 | bwd(ms): 177.21 | optim(ms): 46.52 | loss: 16.96566 | train-TER: 31.50 | train-WER: 45.09 | dev-clean-loss: 5.44645 | dev-clean-TER: 10.42 | dev-clean-WER: 19.35 | dev-other-loss: 7.79601 | dev-other-TER: 18.52 | dev-other-WER: 30.30 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 101 | hrs: 960.40 | thrpt(sec/sec): 289.21
epoch: 6 | nupdates: 210876 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:19:08 | bch(ms): 339.97 | smp(ms): 1.29 | fwd(ms): 114.66 | crit-fwd(ms): 8.62 | bwd(ms): 177.19 | optim(ms): 46.38 | loss: 15.67413 | train-TER: 27.86 | train-WER: 40.86 | dev-clean-loss: 4.75529 | dev-clean-TER: 8.82 | dev-clean-WER: 17.38 | dev-other-loss: 7.03329 | dev-other-TER: 16.09 | dev-other-WER: 28.12 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 103 | hrs: 960.40 | thrpt(sec/sec): 289.36
epoch: 7 | nupdates: 246022 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:19:15 | bch(ms): 340.16 | smp(ms): 1.28 | fwd(ms): 114.65 | crit-fwd(ms): 8.62 | bwd(ms): 177.28 | optim(ms): 46.49 | loss: 14.89369 | train-TER: 26.17 | train-WER: 38.77 | dev-clean-loss: 4.76659 | dev-clean-TER: 8.09 | dev-clean-WER: 16.06 | dev-other-loss: 6.97389 | dev-other-TER: 15.56 | dev-other-WER: 26.65 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 100 | hrs: 960.40 | thrpt(sec/sec): 289.20
epoch: 8 | nupdates: 281168 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:17:52 | bch(ms): 337.80 | smp(ms): 1.12 | fwd(ms): 114.25 | crit-fwd(ms): 8.59 | bwd(ms): 176.61 | optim(ms): 45.42 | loss: 14.25666 | train-TER: 24.98 | train-WER: 37.12 | dev-clean-loss: 4.41068 | dev-clean-TER: 7.34 | dev-clean-WER: 14.95 | dev-other-loss: 6.37601 | dev-other-TER: 14.15 | dev-other-WER: 25.01 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 101 | hrs: 960.40 | thrpt(sec/sec): 291.22
epoch: 9 | nupdates: 316314 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:20:07 | bch(ms): 341.65 | smp(ms): 1.30 | fwd(ms): 114.91 | crit-fwd(ms): 8.64 | bwd(ms): 177.61 | optim(ms): 47.35 | loss: 13.73204 | train-TER: 24.32 | train-WER: 36.40 | dev-clean-loss: 4.19915 | dev-clean-TER: 7.00 | dev-clean-WER: 14.32 | dev-other-loss: 6.25162 | dev-other-TER: 13.64 | dev-other-WER: 24.17 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 100 | hrs: 960.40 | thrpt(sec/sec): 287.94
epoch: 10 | nupdates: 351460 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:21:06 | bch(ms): 343.31 | smp(ms): 1.36 | fwd(ms): 115.20 | crit-fwd(ms): 8.66 | bwd(ms): 178.11 | optim(ms): 48.13 | loss: 13.41283 | train-TER: 22.96 | train-WER: 34.70 | dev-clean-loss: 4.29421 | dev-clean-TER: 6.65 | dev-clean-WER: 13.88 | dev-other-loss: 6.21255 | dev-other-TER: 13.01 | dev-other-WER: 23.41 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 099 | hrs: 960.40 | thrpt(sec/sec): 286.54
epoch: 11 | nupdates: 386606 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:18:43 | bch(ms): 339.27 | smp(ms): 1.17 | fwd(ms): 114.52 | crit-fwd(ms): 8.61 | bwd(ms): 177.03 | optim(ms): 46.12 | loss: 13.27797 | train-TER: 22.35 | train-WER: 33.87 | dev-clean-loss: 4.29342 | dev-clean-TER: 6.63 | dev-clean-WER: 13.60 | dev-other-loss: 6.15515 | dev-other-TER: 12.98 | dev-other-WER: 23.13 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 105 | hrs: 960.40 | thrpt(sec/sec): 289.96
epoch: 12 | nupdates: 421752 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:17:34 | bch(ms): 337.29 | smp(ms): 1.07 | fwd(ms): 114.16 | crit-fwd(ms): 8.59 | bwd(ms): 176.52 | optim(ms): 45.13 | loss: 13.26192 | train-TER: 22.40 | train-WER: 34.09 | dev-clean-loss: 4.14542 | dev-clean-TER: 6.26 | dev-clean-WER: 13.42 | dev-other-loss: 6.14282 | dev-other-TER: 12.63 | dev-other-WER: 23.10 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 106 | hrs: 960.40 | thrpt(sec/sec): 291.66
epoch: 13 | nupdates: 456898 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:17:34 | bch(ms): 337.30 | smp(ms): 1.08 | fwd(ms): 114.19 | crit-fwd(ms): 8.60 | bwd(ms): 176.50 | optim(ms): 45.14 | loss: 13.19533 | train-TER: 22.31 | train-WER: 33.97 | dev-clean-loss: 4.30802 | dev-clean-TER: 6.22 | dev-clean-WER: 13.13 | dev-other-loss: 6.38784 | dev-other-TER: 12.45 | dev-other-WER: 22.41 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 101 | hrs: 960.40 | thrpt(sec/sec): 291.65
epoch: 14 | nupdates: 492044 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:18:32 | bch(ms): 338.95 | smp(ms): 1.16 | fwd(ms): 114.42 | crit-fwd(ms): 8.61 | bwd(ms): 176.96 | optim(ms): 45.98 | loss: 13.15719 | train-TER: 22.51 | train-WER: 34.24 | dev-clean-loss: 4.32588 | dev-clean-TER: 6.19 | dev-clean-WER: 12.97 | dev-other-loss: 6.17860 | dev-other-TER: 12.17 | dev-other-WER: 22.02 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 104 | hrs: 960.40 | thrpt(sec/sec): 290.23
epoch: 15 | nupdates: 527190 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:18:33 | bch(ms): 338.96 | smp(ms): 1.17 | fwd(ms): 114.44 | crit-fwd(ms): 8.61 | bwd(ms): 176.95 | optim(ms): 45.98 | loss: 13.05351 | train-TER: 21.52 | train-WER: 33.13 | dev-clean-loss: 4.28696 | dev-clean-TER: 5.98 | dev-clean-WER: 12.71 | dev-other-loss: 6.10256 | dev-other-TER: 11.96 | dev-other-WER: 21.82 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 105 | hrs: 960.40 | thrpt(sec/sec): 290.22
epoch: 16 | nupdates: 562336 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:18:28 | bch(ms): 338.84 | smp(ms): 1.15 | fwd(ms): 114.40 | crit-fwd(ms): 8.62 | bwd(ms): 176.94 | optim(ms): 45.92 | loss: 12.95347 | train-TER: 20.89 | train-WER: 32.18 | dev-clean-loss: 4.26626 | dev-clean-TER: 5.84 | dev-clean-WER: 12.40 | dev-other-loss: 6.09985 | dev-other-TER: 11.78 | dev-other-WER: 21.37 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 109 | hrs: 960.40 | thrpt(sec/sec): 290.32
epoch: 17 | nupdates: 597482 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:18:30 | bch(ms): 338.88 | smp(ms): 1.16 | fwd(ms): 114.41 | crit-fwd(ms): 8.61 | bwd(ms): 176.94 | optim(ms): 45.95 | loss: 12.89857 | train-TER: 21.38 | train-WER: 32.73 | dev-clean-loss: 4.19587 | dev-clean-TER: 5.99 | dev-clean-WER: 12.50 | dev-other-loss: 6.17563 | dev-other-TER: 12.03 | dev-other-WER: 21.70 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 101 | hrs: 960.40 | thrpt(sec/sec): 290.29
epoch: 18 | nupdates: 632628 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:18:05 | bch(ms): 338.17 | smp(ms): 1.13 | fwd(ms): 114.31 | crit-fwd(ms): 8.60 | bwd(ms): 176.75 | optim(ms): 45.57 | loss: 12.81981 | train-TER: 21.14 | train-WER: 32.39 | dev-clean-loss: 4.03204 | dev-clean-TER: 5.69 | dev-clean-WER: 12.23 | dev-other-loss: 5.97668 | dev-other-TER: 11.63 | dev-other-WER: 21.38 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 100 | hrs: 960.40 | thrpt(sec/sec): 290.89
epoch: 19 | nupdates: 667774 | lr: 0.400000 | lrcriterion: 0.400000 | runtime: 03:17:37 | bch(ms): 337.38 | smp(ms): 1.09 | fwd(ms): 114.23 | crit-fwd(ms): 8.60 | bwd(ms): 176.54 | optim(ms): 45.12 | loss: 12.75817 | train-TER: 21.30 | train-WER: 32.58 | dev-clean-loss: 3.98113 | dev-clean-TER: 5.47 | dev-clean-WER: 11.89 | dev-other-loss: 5.98956 | dev-other-TER: 11.51 | dev-other-WER: 21.28 | avg-isz: 1229 | avg-tsz: 040 | max-tsz: 100 | hrs: 960.40 | thrpt(sec/sec): 291.58
The trend is reasonable, but why is train-WER
higher than dev-*-WER
? Is there something I missed ? THX.
did you used specAugment while training ?
yes. --warmup=32000
, this should be secAugment I guess. Here is my train.cfg
--runname=am_transformer_ctc_librispeech
--rundir=/root/wav2letter-release-20200729/tutorials/1-librispeech_clean/librispeech_rundir
--archdir=/root/wav2letter-release-20200729/recipes/models/sota/2019
--arch=am_arch/am_transformer_ctc.arch
--tokensdir=/root/wav2letter-release-20200729/recipes/models/sota/2019/model_dst/am
--tokens=librispeech-train-all-unigram-10000.tokens
--lexicon=/root/wav2letter-release-20200729/recipes/models/sota/2019/model_dst/am/librispeech-train+dev-unigram-10000-nbest10.lexicon
--train=/app/lists/train-clean-100.lst,/app/lists/train-clean-360.lst,/app/lists/train-other-500.lst
--valid=dev-clean:/app/lists/dev-clean.lst,dev-other:/app/lists/dev-other.lst
--criterion=ctc
--mfsc
--usewordpiece=true
--wordseparator=_
--labelsmooth=0.05
--dataorder=output_spiral
--inputbinsize=25
--softwstd=4
--memstepsize=5000000
--pcttraineval=1
--pctteacherforcing=99
--sampletarget=0.01
--netoptim=adadelta
--critoptim=adadelta
--lr=0.4
--lrcrit=0.4
--linseg=0
--momentum=0.0
--maxgradnorm=1.0
--onorm=target
--sqnorm
--nthread=6
--batchsize=8
--filterbanks=80
--minisz=200
--mintsz=2
--enable_distributed
--warmup=32000
--saug_start_update=32000
--lr_decay=180
--lr_decay_step=40
@xiaosdawn due to specAugment your Train Wer is more than all validation set as in every epoch model is trained on newly augmented data making it difficult to overfit. This is normal if you are using specAugment. You can avoid specAugment but then there is chance that your model might overfit in that case your train Wer will be lowest among all validation sets.
OK, thank you @rajeevbaalwan . I'll keep it running and wait for more results.
As discussed above and my cfg
, the model (transformer_ctc) is trained with total 281241 samples
, and will have 35156 batches (batchsize=8). The warmup=32000
. So when epoch is 2, the specAugment will be activated, and the learning rate is warmup up (linearly increased) to 0.4 (--lr=0.4
). It is then keeping 0.4 till epoch 180 (--lr_decay=180
). I'm not sure if this is specAugment stage from epoch 2 to epoch 180.
Then learning rate is divided every 40 epoch (--lr_decay_step=40
). But what is the decay factor? I noice there is a gamma parameter in Defines.cpp. I'm not sure if gamma controls learning rate decay in this recipe (transformer_ctc).
If I understand something wrong, please let me know. THX.
warmup jus operated with warmup stage when learning rate linearly increased to lr
. To activate specaug you either use layer in the arch file (and specaug will be used from the very first update) or you could specify flag saug_start_update
https://github.com/facebookresearch/wav2letter/blob/master/src/common/Defines.cpp#L90 to specify from which update specaug is applied.
Then learning rate is divided every 40 epoch (
--lr_decay_step=40
). But what is the decay factor? I noice there is a gamma parameter in Defines.cpp. I'm not sure if gamma controls learning rate decay in this recipe (transformer_ctc).
Decay factor is 2, so the lr will be divided by 2 every 40 epochs. For transformer_ctc we are not using gamma factor, only lr_decay_step and lr_decay.
Thanks @tlikhomanenko . So the decay factor is default 2 for transformer_ctc. How can I find it (decay factor) ?
It hardcoded in the code here https://github.com/facebookresearch/wav2letter/blob/master/Train.cpp#L559, so you can add one more flags for this and replace 0.5 to the flag.
Thank you very much. I got it. @tlikhomanenko
Closing the issue for now, seems we solved the main issue. Feel free to reopen or continue thread if it is needed.
I have been training transformer_ctc
for days. train.cfg
is described above.
First, I stoped the training at the end of epoch 50
on 1 GPU. Then, I continue training the model from epoch 51
on 2 GPU. For now, epoch 211
, the lr
is 0.4 as the same as value specific in --lr
.
rajeevbaalwan described above:
LR is 0.4 till epoch 180 LR is 0.2 from epoch 181-220 LR is 0.1 from epoch 221-260 LR is 0.05 from epoch 261-300
I don't know if it's due to the number of gpus used for training, or it's due to the continue
training. Maybe LR is 0.4 till epoch 230 (50+180).
BTW, training is normal. It's just a little confused about the lr
decay.
Looking forward to help. THX.
Edit: LR is 0.4 till epoch 219 LR is 0.2 from epoch 220-
Edit:
LR is 0.4 till epoch 219
LR is 0.2 from epoch 220-259
LR is 0.1 from epoch 260-
So, starting from epoch 220, LR is divided by 2 every 40 epoch. This is normal.
But lr
change is still a little confused epoch from 1 to 219. My settings as mentioned above:
As discussed above and my cfg, the model (transformer_ctc) is trained with total 281241 samples, and will have 35156 batches (batchsize=8). The warmup=32000. So when epoch is 2, the specAugment will be activated, and the learning rate is warmup up (linearly increased) to 0.4 (--lr=0.4). It is then keeping 0.4 till epoch 180 (--lr_decay=180).
If I missed something, please let me know. Hope to help sovling the puzzle. THX.
@xiaosdawn we recently fixed (again) lr decay, it was really bogus bug, that decay happens not at lr_decay but lr_decay + lr_decay_step. Now it is fixed, see commit e7c4d174ab581ce28df7cd3518ad936eaa752cea.
That's wonderful. Thanks all of you. I'll try more.
I am trying to achieve some good results on Libri 100 Hour data using transformer + CTC architecture provided in https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/sota/2019/librispeech/train_am_transformer_ctc.cfg
I have found this information in doc for sota Transformer model.
I have few Doubts based on above paragraph
I've found 2 params which might be relevant for warmup and specAug:
1.what value do i need to specify for warmup param? which is number of updates to warmup i.e 32000?