Open qzfnihao opened 4 years ago
Could you pull latest w2l? Or you can set --warmup=1
in your config (we set default value for warmup=8000, but in latest commit changed to 1 to be consistent with tds training where warmup is not used and by default it should be no warmup). Check that then during training lr and lrcrit are constant and are set to the value you provided.
@tlikhomanenko It is mentioned in SOTA documentation for transformer:
The model is trained with total batch size 128 for approximatively 320 epochs with Adadelta. There is a warmup stage: SpecAugment is activated only after warmup, and the learning rate is warmed up (linearly increased) over the first 32000 updates to 0.4. It is then divided by 2 at epoch 180, and then every 40 epochs. Last 10 epochs are done with lr=0.001.
But it is not mentioned that on how many GPU the training should be done with the provided config file.
Due to this it is hard to complete the training there always comes some issues during training.
It would be better if it is also provided that the given config parameters will work for how many GPU's.
Consider for transformer architecture the batch size in the given config is 8 but in doc it is mentioned that total batch size is 128. So I can conclude that the training should be done on 128/8 = 16 GPU's ?
@qzfnihao Did you generated this lexicon file librispeech-train+dev-unigram-10000-nbest10.lexicon using the code provided in repo or you just downloaded this from repo?
I have used the same file and instead of creating it for librispeech I have downloaded it from repo. But the problem is in lexicon file lexicons are in _word format and in code the lexicon file is expected in word w o r d |
@rajeevbaalwan
Consider for transformer architecture the batch size in the given config is 8 but in doc it is mentioned that total batch size is 128. So I can conclude that the training should be done on 128/8 = 16 GPU's ?
Yep, this is correct. 16 GPUs. Do you think it is good to have something like "with total batchsize 128 (trained on 16 GPUs)"? We specified the total batch size because here you can use whatever number of gpus you have (depending on memory) because the training step depends only on the total batch size.
But the problem is in lexicon file lexicons are in _word format and in code the lexicon file is expected in word w o r d |
What do you mean here? Lexicon should be in word-pieces sequence because the tokens set of AMs are word-pieces, not letters.
What do you mean here? Lexicon should be in word-pieces sequence because the tokens set of AMs are word-pieces, not letters.
I have mentioned issue here.
@tlikhomanenko Most of the people don't have 16 GPU's to try out these models on the exact configurations provided in sota/2019. I am also trying to reproduce transformer+CTC results on librispeech and using only 4 GPU's and it is hard to find out correct set of parameters to train these sota models on limited Hardware. It would be better if some information is provided regarding how to tweak the configuration parameters depending on GPU count i.e total batch size, hours of data etc. It would be very helpful.
Yep, we know that people mostly don't have so many resources.
We specified the total batch size, so that you could understand how you need, for example, to scale learning rate in this case or why you cannot reproduce results. We didn't test with smaller number of GPUs, that is why no recipes how to tweak params. For that reason, we publish our trained models so you could use them for example for finetuning and not spending resources on retraining models on Librispeech. Obviously, parameters setting for another dataset will be different.
About how many hours we trained our models - we specified number of epochs for this, so you could estimate this depending on resources you have looking at time for 1 epoch or for 1 update. In case of our resources (check on total batch size) for transformer models it took 3-5 days, and 7/21 days for others in case of librispeech/librivox experiments.
Some tips on adapting to smaller resources:
i tried to reproduce the results in librispeech, and using train_am_tds_ctc.cfg: --runname=am_tds_ctc_librispeech --rundir=/root/wav2letter.debug/recipes/models/sota/2019/librispeech/ --archdir=/root/wav2letter.debug/recipes/models/sota/2019/ --arch=am_arch/am_tds_ctc.arch --tokensdir=/root/wav2letter.debug/recipes/models/sota/2019/model_data/am --tokens=librispeech-train-all-unigram-10000.tokens --lexicon=/root/wav2letter.debug/recipes/models/sota/2019/modeldata/am/librispeech-train+dev-unigram-10000-nbest10.lexicon --train=/root/librispeech/lists/train-clean-100.lst,/root/librispeech/lists/train-clean-360.lst,/root/librispeech/lists/train-other-500.lst --valid=dev-clean:/root/librispeech/lists/dev-clean.lst,dev-other:/root/librispeech/lists/dev-other.lst --batchsize=16 --lr=0.3 --momentum=0.5 --maxgradnorm=1 --onorm=target --sqnorm=true --mfsc=true --nthread=10 --criterion=ctc --wordseparator= --usewordpiece=true --filterbanks=80 --framesizems=30 --framestridems=10 --seed=2 --reportiters=500 --enable_distributed=true --gamma=0.5 --stepsize=200
but i get the 001_log as follow: epoch: 1 | nupdates: 500 | lr: 0.000000 | lrcriterion: 0.000000 | runtime: 00:08:30 | bch(ms): 1020.84 | smp(ms): 3.20 | fwd(ms): 280.11 | crit-fwd(ms): 13.44 | bwd(ms): 651.17 | optim(ms): 48.47 | loss: 47.19106 | train-TER: 100.69 | train-WER: 100.80 | dev-clean-loss: 35.89173 | dev-clean-TER: 100.00 | dev-clean-WER: 100.00 | dev-other-loss: 33.71466 | dev-other-TER: 100.00 | dev-other-WER: 100.00 | avg-isz: 1195 | avg-tsz: 047 | max-tsz: 076 | hrs: 106.24 | thrpt(sec/sec): 749.32 epoch: 1 | nupdates: 1000 | lr: 0.000000 | lrcriterion: 0.000000 | runtime: 00:08:41 | bch(ms): 1043.83 | smp(ms): 1.56 | fwd(ms): 265.69 | crit-fwd(ms): 13.56 | bwd(ms): 696.78 | optim(ms): 45.87 | loss: 44.21445 | train-TER: 100.00 | train-WER: 100.00 | dev-clean-loss: 35.89173 | dev-clean-TER: 100.00 | dev-clean-WER: 100.00 | dev-other-loss: 33.71466 | dev-other-TER: 100.00 | dev-other-WER: 100.00 | avg-isz: 1206 | avg-tsz: 047 | max-tsz: 075 | hrs: 107.28 | thrpt(sec/sec): 740.00 epoch: 1 | nupdates: 1500 | lr: 0.000000 | lrcriterion: 0.000000 | runtime: 00:08:33 | bch(ms): 1027.03 | smp(ms): 1.55 | fwd(ms): 270.01 | crit-fwd(ms): 13.66 | bwd(ms): 675.91 | optim(ms): 45.92 | loss: 44.64984 | train-TER: 100.00 | train-WER: 100.00 | dev-clean-loss: 35.89173 | dev-clean-TER: 100.00 | dev-clean-WER: 100.00 | dev-other-loss: 33.71466 | dev-other-TER: 100.00 | dev-other-WER: 100.00 | avg-isz: 1224 | avg-tsz: 048 | max-tsz: 076 | hrs: 108.81 | thrpt(sec/sec): 762.85 epoch: 1 | nupdates: 2000 | lr: 0.000000 | lrcriterion: 0.000000 | runtime: 00:08:31 | bch(ms): 1023.50 | smp(ms): 1.61 | fwd(ms): 269.77 | crit-fwd(ms): 13.80 | bwd(ms): 672.94 | optim(ms): 45.84 | loss: 44.63917 | train-TER: 100.00 | train-WER: 100.00 | dev-clean-loss: 35.89173 | dev-clean-TER: 100.00 | dev-clean-WER: 100.00 | dev-other-loss: 33.71466 | dev-other-TER: 100.00 | dev-other-WER: 100.00 | avg-isz: 1223 | avg-tsz: 048 | max-tsz: 081 | hrs: 108.77 | thrpt(sec/sec): 765.14 epoch: 1 | nupdates: 2500 | lr: 0.000000 | lrcriterion: 0.000000 | runtime: 00:08:32 | bch(ms): 1024.02 | smp(ms): 1.58 | fwd(ms): 268.82 | crit-fwd(ms): 13.59 | bwd(ms): 674.68 | optim(ms): 45.68 | loss: 44.53821 | train-TER: 100.00 | train-WER: 100.00 | dev-clean-loss: 35.89173 | dev-clean-TER: 100.00 | dev-clean-WER: 100.00 | dev-other-loss: 33.71466 | dev-other-TER: 100.00 | dev-other-WER: 100.00 | avg-isz: 1220 | avg-tsz: 048 | max-tsz: 079 | hrs: 108.51 | thrpt(sec/sec): 762.97 epoch: 1 | nupdates: 3000 | lr: 0.000000 | lrcriterion: 0.000000 | runtime: 00:08:40 | bch(ms): 1041.39 | smp(ms): 1.60 | fwd(ms): 275.54 | crit-fwd(ms): 13.93 | bwd(ms): 685.27 | optim(ms): 45.68 | loss: 45.11741 | train-TER: 100.00 | train-WER: 100.00 | dev-clean-loss: 35.89173 | dev-clean-TER: 100.00 | dev-clean-WER: 100.00 | dev-other-loss: 33.71466 | dev-other-TER: 100.00 | dev-other-WER: 100.00 | avg-isz: 1248 | avg-tsz: 049 | max-tsz: 083 | hrs: 110.99 | thrpt(sec/sec): 767.35 epoch: 1 | nupdates: 3500 | lr: 0.000000 | lrcriterion: 0.000000 | runtime: 00:08:32 | bch(ms): 1024.40 | smp(ms): 1.59 | fwd(ms): 268.83 | crit-fwd(ms): 13.68 | bwd(ms): 674.81 | optim(ms): 45.73 | loss: 44.48326 | train-TER: 100.00 | train-WER: 100.00 | dev-clean-loss: 35.89173 | dev-clean-TER: 100.00 | dev-clean-WER: 100.00 | dev-other-loss: 33.71466 | dev-other-TER: 100.00 | dev-other-WER: 100.00 | avg-isz: 1218 | avg-tsz: 048 | max-tsz: 079 | hrs: 108.33 | thrpt(sec/sec): 761.38 epoch: 1 | nupdates: 4000 | lr: 0.000000 | lrcriterion: 0.000000 | runtime: 00:08:40 | bch(ms): 1041.84 | smp(ms): 1.60 | fwd(ms): 275.15 | crit-fwd(ms): 13.91 | bwd(ms): 685.99 | optim(ms): 45.68 | loss: 45.04881 | train-TER: 100.00 | train-WER: 100.00 | dev-clean-loss: 35.89173 | dev-clean-TER: 100.00 | dev-clean-WER: 100.00 | dev-other-loss: 33.71466 | dev-other-TER: 100.00 | dev-other-WER: 100.00 | avg-isz: 1248 | avg-tsz: 049 | max-tsz: 075 | hrs: 110.94 | thrpt(sec/sec): 766.67 epoch: 2 | nupdates: 4500 | lr: 0.000000 | lrcriterion: 0.000000 | runtime: 00:08:49 | bch(ms): 1058.68 | smp(ms): 10.64 | fwd(ms): 277.00 | crit-fwd(ms): 14.05 | bwd(ms): 691.72 | optim(ms): 45.72 | loss: 45.16758 | train-TER: 100.00 | train-WER: 100.00 | dev-clean-loss: 36.31457 | dev-clean-TER: 100.00 | dev-clean-WER: 100.00 | dev-other-loss: 34.09374 | dev-other-TER: 100.00 | dev-other-WER: 100.00 | avg-isz: 1259 | avg-tsz: 049 | max-tsz: 091 | hrs: 111.98 | thrpt(sec/sec): 761.54 epoch: 2 | nupdates: 5000 | lr: 0.000000 | lrcriterion: 0.000000 | runtime: 00:08:36 | bch(ms): 1032.60 | smp(ms): 1.63 | fwd(ms): 270.11 | crit-fwd(ms): 13.84 | bwd(ms): 680.64 | optim(ms): 45.82 | loss: 44.48489 | train-TER: 100.00 | train-WER: 100.00 | dev-clean-loss: 36.31457 | dev-clean-TER: 100.00 | dev-clean-WER: 100.00 | dev-other-loss: 34.09374 | dev-other-TER: 100.00 | dev-other-WER: 100.00 | avg-isz: 1224 | avg-tsz: 048 | max-tsz: 083 | hrs: 108.83 | thrpt(sec/sec): 758.84
the loss and WER didnot reduce.
I read Train.cpp, and found the parameters gamma and stepsize to control learning rate decay. so I remove the 2 parameters and retrain, This time , the loss was first down and then up after 3 iters and WER was 100. And i read README again and finetuning stepsize to 800, because i use 4-gpu and each with batch 16. and Still I got the first result.
what's wrong in my parameters setting?