Architecture related issue

samin9796 commented 4 years ago

I am trying to train a dataset of around 250 hours with different architectures such as resnet, TDS and transformer mentioned in the recipe section. But my computer automatically restarts when I use these architectures with the same number of layers and other parameters. If I try to reduce the number of layers or some parameters on my own, it shows Cudnn bad parameter or arrayfire exception(invalid size). I have Nvidia GeForce gtx 1080ti card. It has 8 GPUs, each GPU 11 GB. If I use Nvidia-smi, I see there are 7 GPUs running ( I set # of gpu to 7 in the train.cfg), each taking only around 3500-4000 MB. Any solutions?

tlikhomanenko commented 4 years ago

@samin9796

Your error "Cudnn bad parameter or arrayfire exception(invalid size)." is related to the sizes of parameter/inputs you use. Just removing layer will not work, you need to be sure sizes of inputs to the layers and layers parameters sizes are consistent. If you need help here, please post your arch which gives this error. One thing you can try also is to use batchsize=1.

About computer restarting - have no idea, better to ask on nvidia cite if this could happens in case of OOM. Possibly you have internal setting of computer itself on restarting (like temperature or something else).

samin9796 commented 4 years ago

@tlikhomanenko

This is the architecture I used last time but got no error. However, after only two epochs, training was finished although it was supposed to be trained for 200 epochs. I found out that learning rate and lrcrit were decreasing pretty fast and eventually became zero.

V -1 NFEAT 1 0 C2 1 10 21 1 2 1 -1 -1 R DO 0.2 LN 0 1 2 TDS 10 21 80 0.2 TDS 10 21 80 0.2 C2 10 14 21 1 2 1 -1 -1 R DO 0.2 LN 0 1 2 TDS 14 21 80 0.2 TDS 14 21 80 0.2 TDS 14 21 80 0.2 C2 14 18 21 1 2 1 -1 -1 R DO 0.2 LN 0 1 2 TDS 18 21 80 0.2 TDS 18 21 80 0.2 TDS 18 21 80 0.2 TDS 18 21 80 0.2 TDS 18 21 80 0.2 TDS 18 21 80 0.2 V 0 1440 1 0 RO 1 0 3 2 L 1440 1024

This is the train.cfg. Criterion is seq2seq.

--rundir=/data/ahnaf/wav2letter/dataset_prep/ --datadir=/data/ahnaf/wav2letter/dataset_prep/ --tokensdir=/data/ahnaf/wav2letter/dataset_prep/ --train=train.lst --valid=validation.lst --lexicon=/data/ahnaf/wav2letter/dataset_prep/lexicon.txt --input=wav --tokens=tokens.txt --archdir=/data/ahnaf/wav2letter/dataset_prep/ --arch=network.arch --batchsize=4 --lr=0.5 --lrcrit=0.05 --momentum=0.5 --maxgradnorm=15 --mfsc=true --nthread=7 --criterion=seq2seq --maxdecoderoutputlen=120 --labelsmooth=0.05 --dataorder=outputspiral --inputbinsize=25 --attnWindow=softPretrain --softwstd=4 --trainWithWindow=true --pretrainWindow=3 --attention=keyvalue --encoderdim=512 --memstepsize=8338608 --eostoken=true --pcttraineval=1 --pctteacherforcing=99 --listdata=true --usewordpiece=true --wordseparator= --target=ltr --filterbanks=80 --sampletarget=0.01 --enable_distributed=true --iter=200 --framesizems=30 --framestridems=10 --decoderdropout=0.1 --decoderattnround=2 --decoderrnnlayer=3 --seed=2

tlikhomanenko commented 4 years ago

This is because --iter is not the number of epochs, it is number of updates. You can see that in the log it stopped with nupdates=201. Also set warmup=0 (it is 8k by default, however it is useful only for transformer arch).

Let me know if this solves the issue.

samin9796 commented 4 years ago

@tlikhomanenko Thank you. I could train a model with seq2seq TDS architecture. Now I would like to train a seq2seq transformer based model. If I train two models on 250 hrs and 400 hrs dataset, I am wondering if the following arch would be a good one or I need to reduce or increase the number of layers and other parameters. I am following the same training parameters mentioned in the recipe section.

V -1 1 NFEAT 0 WN 3 C NFEAT 1024 3 1 -1 GLU 2 DO 0.2 M 1 1 2 1 WN 3 C 512 1024 3 1 -1 GLU 2 DO 0.2 M 1 1 2 1 WN 3 C 512 1536 3 1 -1 GLU 2 DO 0.2 M 1 1 2 1 RO 2 0 3 1 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 TR 768 3072 4 460 0.2 0.2 L 768 512

tlikhomanenko commented 4 years ago

@samin9796

You could at first start with this arch and see how it goes, and if overfitting happens for you on train. If it overfits then reduce number of layers, try larger dropout. At least number of layers and dropout are the first thing you need to try for your data. Also learning rate and warmup are important, so you need to tune them too.

Some note: use instead of M 1 1 2 1 this one M 2 1 2 1.

samin9796 commented 4 years ago

@tlikhomanenko

Could you please tell me what these parameters are after TR? I can't find in the " writing arch file" doc?

TR 768 3072 4 460 0.2 0.2

Also, training loss and validation loss is decreasing, but no change in TER and WER. Initially, I set warmup to 20000 and after 5000 iterations, I stopped training and then continued training again. Here is the log:

epoch: 3 | nupdates: 14213 | lr: 0.142130 | lrcriterion: 0.035533 | runtime: 01:00:47 | bch(ms): 395.86 | smp(ms): 0.21 | fwd(ms): 52.40 | crit-fwd(ms): 5.70 | bwd(ms): 295.21 | optim(ms): 42.23 | loss: 259.70415 | train-TER: 85.91 | train-WER: 109.50 | validation.lst-loss: 170.30988 | validation.lst-TER: 84.72 | validation.lst-WER: 105.74 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 227.89 epoch: 4 | nupdates: 23426 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:00:36 | bch(ms): 394.71 | smp(ms): 0.23 | fwd(ms): 51.42 | crit-fwd(ms): 5.32 | bwd(ms): 295.57 | optim(ms): 41.98 | loss: 231.05927 | train-TER: 85.05 | train-WER: 109.67 | validation.lst-loss: 141.39666 | validation.lst-TER: 83.19 | validation.lst-WER: 114.39 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 228.55 epoch: 5 | nupdates: 32639 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:00:41 | bch(ms): 395.29 | smp(ms): 0.21 | fwd(ms): 51.46 | crit-fwd(ms): 5.34 | bwd(ms): 295.97 | optim(ms): 42.05 | loss: 210.76740 | train-TER: 85.86 | train-WER: 110.36 | validation.lst-loss: 114.67998 | validation.lst-TER: 85.03 | validation.lst-WER: 103.81 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 228.21 epoch: 6 | nupdates: 41852 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:00:59 | bch(ms): 397.19 | smp(ms): 0.19 | fwd(ms): 51.40 | crit-fwd(ms): 5.31 | bwd(ms): 298.04 | optim(ms): 41.97 | loss: 196.30901 | train-TER: 85.32 | train-WER: 108.88 | validation.lst-loss: 100.69554 | validation.lst-TER: 85.51 | validation.lst-WER: 102.83 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 227.12 epoch: 7 | nupdates: 51065 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:00:53 | bch(ms): 396.58 | smp(ms): 0.20 | fwd(ms): 51.54 | crit-fwd(ms): 5.33 | bwd(ms): 297.21 | optim(ms): 42.00 | loss: 186.97773 | train-TER: 85.04 | train-WER: 108.28 | validation.lst-loss: 91.19413 | validation.lst-TER: 84.70 | validation.lst-WER: 101.01 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 227.47 epoch: 8 | nupdates: 60278 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:00:59 | bch(ms): 397.24 | smp(ms): 0.20 | fwd(ms): 51.40 | crit-fwd(ms): 5.33 | bwd(ms): 298.00 | optim(ms): 42.00 | loss: 179.90211 | train-TER: 85.54 | train-WER: 108.58 | validation.lst-loss: 81.97382 | validation.lst-TER: 83.41 | validation.lst-WER: 101.62 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 227.09 epoch: 9 | nupdates: 69491 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:00:38 | bch(ms): 394.89 | smp(ms): 0.20 | fwd(ms): 51.42 | crit-fwd(ms): 5.32 | bwd(ms): 295.69 | optim(ms): 41.94 | loss: 174.64809 | train-TER: 85.48 | train-WER: 109.00 | validation.lst-loss: 78.12224 | validation.lst-TER: 86.13 | validation.lst-WER: 106.98 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 228.44 epoch: 10 | nupdates: 78704 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:00:43 | bch(ms): 395.43 | smp(ms): 0.21 | fwd(ms): 51.24 | crit-fwd(ms): 5.33 | bwd(ms): 296.32 | optim(ms): 42.02 | loss: 170.35678 | train-TER: 85.39 | train-WER: 109.50 | validation.lst-loss: 73.18929 | validation.lst-TER: 85.16 | validation.lst-WER: 103.22 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 228.13 epoch: 11 | nupdates: 87917 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:00:50 | bch(ms): 396.29 | smp(ms): 0.21 | fwd(ms): 51.67 | crit-fwd(ms): 5.39 | bwd(ms): 296.38 | optim(ms): 42.27 | loss: 166.88087 | train-TER: 85.57 | train-WER: 109.02 | validation.lst-loss: 69.65967 | validation.lst-TER: 86.52 | validation.lst-WER: 105.94 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 227.64 epoch: 12 | nupdates: 97130 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:00:51 | bch(ms): 396.32 | smp(ms): 0.21 | fwd(ms): 51.45 | crit-fwd(ms): 5.33 | bwd(ms): 296.89 | optim(ms): 42.08 | loss: 163.83704 | train-TER: 85.45 | train-WER: 108.16 | validation.lst-loss: 67.06947 | validation.lst-TER: 85.80 | validation.lst-WER: 103.20 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 227.62 epoch: 13 | nupdates: 106343 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:00:41 | bch(ms): 395.28 | smp(ms): 0.22 | fwd(ms): 51.36 | crit-fwd(ms): 5.32 | bwd(ms): 296.07 | optim(ms): 41.98 | loss: 161.33239 | train-TER: 85.56 | train-WER: 108.96 | validation.lst-loss: 65.31834 | validation.lst-TER: 83.51 | validation.lst-WER: 117.52 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 228.22 epoch: 14 | nupdates: 115556 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:00:42 | bch(ms): 395.39 | smp(ms): 0.19 | fwd(ms): 51.12 | crit-fwd(ms): 5.31 | bwd(ms): 296.45 | optim(ms): 41.96 | loss: 159.19259 | train-TER: 85.26 | train-WER: 107.28 | validation.lst-loss: 62.88444 | validation.lst-TER: 84.42 | validation.lst-WER: 111.49 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 228.16 epoch: 15 | nupdates: 124769 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:00:37 | bch(ms): 394.84 | smp(ms): 0.19 | fwd(ms): 51.22 | crit-fwd(ms): 5.31 | bwd(ms): 295.81 | optim(ms): 41.96 | loss: 157.04240 | train-TER: 85.79 | train-WER: 108.01 | validation.lst-loss: 61.11823 | validation.lst-TER: 85.27 | validation.lst-WER: 103.10 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 228.47 epoch: 16 | nupdates: 133982 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:00:50 | bch(ms): 396.23 | smp(ms): 0.20 | fwd(ms): 51.42 | crit-fwd(ms): 5.32 | bwd(ms): 296.89 | optim(ms): 42.02 | loss: 155.53830 | train-TER: 85.68 | train-WER: 108.22 | validation.lst-loss: 59.90708 | validation.lst-TER: 84.86 | validation.lst-WER: 105.56 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 227.67 epoch: 17 | nupdates: 143195 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:00:54 | bch(ms): 396.68 | smp(ms): 0.21 | fwd(ms): 51.61 | crit-fwd(ms): 5.39 | bwd(ms): 296.86 | optim(ms): 42.24 | loss: 154.04015 | train-TER: 85.43 | train-WER: 107.41 | validation.lst-loss: 60.48226 | validation.lst-TER: 84.44 | validation.lst-WER: 108.85 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 227.42 epoch: 18 | nupdates: 152408 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:01:09 | bch(ms): 398.28 | smp(ms): 0.23 | fwd(ms): 51.56 | crit-fwd(ms): 5.38 | bwd(ms): 298.56 | optim(ms): 42.22 | loss: 152.61494 | train-TER: 85.44 | train-WER: 108.87 | validation.lst-loss: 59.25451 | validation.lst-TER: 85.29 | validation.lst-WER: 114.08 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 226.50 epoch: 19 | nupdates: 161621 | lr: 0.200000 | lrcriterion: 0.050000 | runtime: 01:00:38 | bch(ms): 394.90 | smp(ms): 0.20 | fwd(ms): 51.22 | crit-fwd(ms): 5.31 | bwd(ms): 295.84 | optim(ms): 41.97 | loss: 151.35617 | train-TER: 85.40 | train-WER: 109.43 | validation.lst-loss: 57.08768 | validation.lst-TER: 85.29 | validation.lst-WER: 114.08 | avg-isz: 1288 | avg-tsz: 132 | max-tsz: 270 | hrs: 230.86 | thrpt(sec/sec): 228.44

tlikhomanenko commented 4 years ago

The info of params is here https://github.com/facebookresearch/wav2letter/blob/master/src/module/W2lModule.cpp#L115 TR embeddingDim mlpDim nHeads maxPositions dropout layerDropout usePreNormLayer (added to the arch wiki, let me know if more info on param you need).

About your loss and WER - loss is still very high. Which criterion and optimizer do you use? try the same init lr for both criterion and network.

samin9796 commented 4 years ago

@tlikhomanenko I am using the train_am_transformer_s2s.cfg one from the recipes. Criterion is transformer, netoptim and critoptim are both adagrad.

This is the train.cfg:

--runname=transformer_less --rundir=/data/ahnaf/wav2letter/dataset_prep/all_models/ --datadir=/data/ahnaf/wav2letter/dataset_prep/ --tokensdir=/data/ahnaf/wav2letter/dataset_prep/ --train=train.lst --valid=validation.lst --lexicon=/data/ahnaf/wav2letter/dataset_prep/lexicon.txt --input=wav --tokens=tokens.txt --archdir=/data/ahnaf/wav2letter/dataset_prep/ --arch=network_backup.arch --criterion=transformer --mfsc --am_decoder_tr_dropout=0.1 --am_decoder_tr_layerdrop=0.1 --am_decoder_tr_layers=6 --maxdecoderoutputlen=120 --labelsmooth=0.05 --dataorder=output_spiral --inputbinsize=25 --attnWindow=softPretrain --softwstd=4 --trainWithWindow=true --pretrainWindow=3 --attention=keyvalue --encoderdim=256 --memstepsize=5000000 --eostoken=true --pcttraineval=1 --pctteacherforcing=99 --sampletarget=0.01 --netoptim=adagrad --critoptim=adagrad --lr=0.2 --lrcrit=0.05 --linseg=0 --momentum=0.0 --maxgradnorm=0.1 --onorm=target --sqnorm --nthread=7 --batchsize=1 --filterbanks=80 --minloglevel=0 --enable_distributed --warmup=16000

tlikhomanenko commented 4 years ago

First - try latest master, we recently fixed some issue with lr setting for s2s models, then try to have the same lr and lrcrit. And you need to tune lr and warmup.

cc @syhw maybe you know what else can be done here.

samin9796 commented 4 years ago

@tlikhomanenko My audio files are between 11 sec to 15 sec. Here in the arch maxposition is set to 460. Does maxposition value need to be increased for that?

tlikhomanenko commented 4 years ago

We used this 460 for 36s audio, so it should be fine for you too. If you have longer input then it will crash so you will see this for sure.

tlikhomanenko commented 4 years ago

I just noticed that you have --maxgradnorm=15, possibly too large, try to set to 1 or even further 0.1.

Bernardo-Favoreto commented 4 years ago

You could at first start with this arch and see how it goes, and if overfitting happens for you on train. If it overfits then reduce number of layers, try larger dropout. At least number of layers and dropout are the first thing you need to try for your data. Also learning rate and warmup are important, so you need to tune them too.

@tlikhomanenko Is there a recommended way to tune it? For instance, I have around 100h of audio on my language and would like to use transformers. Should I reduce number of layers (because of less data)? Increase or decrease warmup/learning rate? And also, does batch size plays a major role in tuning these too? Thanks!

tlikhomanenko commented 4 years ago

You can decrease layers / increase dropout / increase specaug - you need to see if there is a huge overfitting with less data, this depends not only on the data size but also complexity of your audio. Warmup / lr you can try the same at first and then for sure tune lr. Warmup - depending if you see blowing up of the model. Batchsize could influence here on the lr values.

Bernardo-Favoreto commented 4 years ago

Thanks, Tatiana!

When you say increase specaug you mean the number of steps or its options (fmask, fmaskn, etc...)?

Also, "blowing up of the model" is the loss skyrocketing? It is a bit unclear to me how metrics behave during training. For instance, my models seem to get stuck in a loss value (lowest I got was ~16) while WER for both dev and train are also stuck in 100+ after 50+ epochs. My guess is I am not tuning parameters/architectures accordingly, but I am unsure since I don't know what to expect during training.

Finally, last epochs before I stopped training: date time epoch nupdates lr lrcriterion runtime bch(ms) smp(ms) fwd(ms) crit-fwd(ms) bwd(ms) optim(ms) loss train-TER train-WER lists/dev.lst-loss lists/dev.lst-TER lists/dev.lst-WER avg-isz avg-tsz max-tsz hrs thrpt(sec/sec) 2020-08-18 16:12:59 80 18866 0.300000 0.300000 00:03:29 1118.27 3.66 364.45 2.25 660.93 86.99 16.90906 72.68 106.15 16.24165 69.54 106.76 496 051 144 24.78 426.61 2020-08-18 16:16:33 81 19053 0.300000 0.300000 00:03:29 1117.78 4.32 362.85 2.26 661.62 86.97 16.88146 71.54 110.64 16.28924 68.78 105.81 496 051 144 24.78 426.80 2020-08-18 16:20:08 82 19240 0.300000 0.300000 00:03:28 1117.56 2.79 363.47 2.25 661.92 86.81 16.90848 72.19 107.01 16.72832 67.46 104.31 496 051 144 24.78 426.88 2020-08-18 16:23:43 83 19427 0.300000 0.300000 00:03:30 1123.53 4.56 365.41 2.22 665.88 86.82 16.89449 75.45 106.15 16.36798 68.44 104.85 496 051 144 24.78 424.61

tlikhomanenko commented 4 years ago

When you say increase specaug you mean the number of steps or its options (fmask, fmaskn, etc...)?

more augmentation, so change saug parameters fmask, etc.

At first you can disable specaug and tune parameters that on the train you get ~0 WER and ~0 loss. During training WER for some time on train stays 100+ while loss is going down, then loss continue to go down and wer goes down too. Blowing up could happens after warmup - loss is going up or not decreasing.

Bernardo-Favoreto commented 4 years ago

Would you say given that I have less data and a smaller model (currently using 16 transformer layers) with the same batch size as you guys (128) learning rate should be smaller? Also, regarding warmup and specaug I'm starting it at 24k steps.

tlikhomanenko commented 4 years ago

Learning rate depends on the data and model size too, you cannot say that for the same batch learning rate will be the same. Just monitor the training WER and dev WER, how it behaves. Also you can increase dropout in your model to avoid overfitting (if I forgot to mention it).

Bernardo-Favoreto commented 4 years ago

Right, thanks. I'll present to you some of the things I've encountered.

I've tried the Transformer architecture trained with CTC (am_transformer_ctc.arch) with both 7 and 16 layers of transformer (vs. the original 24). I have around 100h of Portuguese speech data and trained both models until they (at least is what it seemed) stopped learning. Here are their logs (I've uploaded them to avoid spamming here): Transformer 7 layers, batch size=88GPUs, warmup/specaug=16k steps, lr=0.4: https://justpaste.it/259op Transformer 16 layers, batch size =168GPUs, warmup/specaug=24k steps, lr=0.4: https://justpaste.it/53ko5 Some things that caught my attention: Even though the latter model is larger, its metrics got stuck on slightly worse values than the former (~50 vs. ~60 train-WER) but both of them seem like they reached their limits before I stopped training (the last epochs we can see the model is clearly stuck). Another comment is regarding the fact that dev-WER was always lower than train-WER, is this expected? I was wondering if you have any thoughts on the presented behavior, as I'm not sure what would be my best direction right now and I also don't know the expected behavior from training these models. Should I try the exact same architecture with the exact same hyper-parameters? But would this be appropriate as I have 1/10 of the librispeech data? It also seems a bit counterintuitive since my smaller model performed slightly better. Anyway, I'd be happy to hear any thoughts/comments regarding the presented behavior.

Thank you ever so much, Tatiana.

tlikhomanenko commented 4 years ago

dev-WER is lower because you have specaug, so train loss and WER is computed on the augmented data. Also to be sure that model is trained you need to look at dev WER, not on train.

About model capacity to prevent overfitting:

use smaller one, less number of parameters or
set larger dropout there (and layer drop)

I would suggest to have at first baseline which you train without spec aug and make sure on the training you can easily overfit.

Then if on train you can reach less than 50% WER, activate saug only then. Probably your warmup is also very short, or you need to reduce learning rate.

For larger model try to increase dropout.

Looks they converge, here you can start to reduce learning rate, as we did for our SOTA results, use lr_decay and lr_decay_step.

Bernardo-Favoreto commented 4 years ago

Thanks, Tatiana!

I was happy to see even though metrics weren't as good as I expected, the model still performs reasonably well. I find a bit weird that I can't get better metrics since I have ~100h of data and I've seen great results with WSJ dataset. What comes into mind is data quality or maybe difficulties of the language itself (Brazilian Portuguese)?

I resumed the 7 layer transformer model but with 0.4 LR for a while and then started to halve it. Indeed, the model improved a bit on dev-WER/loss, I will resume once again now with 0.1 LR and keep halving every 40 epochs.

Regarding LR decay: for both models, they were set exactly as the recipe came, but it didn't seem to work properly, maybe I need to adapt this schedule for less data?

Finally, regarding train-WER/loss I should not worry if they are stuck, as long as dev metrics are decreasing?

Thank you once again!

tlikhomanenko commented 4 years ago

@Bernardo-Favoreto always welcome!

I was happy to see even though metrics weren't as good as I expected, the model still performs reasonably well. I find a bit weird that I can't get better metrics since I have ~100h of data and I've seen great results with WSJ dataset. What comes into mind is data quality or maybe difficulties of the language itself (Brazilian Portuguese)?

Yep, this depends on a lot of factors, like language, audio quality (noisy conditions, reading or conversational speech), speakers, accents, lexicon.

I resumed the 7 layer transformer model but with 0.4 LR for a while and then started to halve it. Indeed, the model improved a bit on dev-WER/loss, I will resume once again now with 0.1 LR and keep halving every 40 epochs.

Nice

Regarding LR decay: for both models, they were set exactly as the recipe came, but it didn't seem to work properly, maybe I need to adapt this schedule for less data?

Our SOTA recipes is good start but it is optimized for specific dataset, so at least some training params, like start lr and warmup/augmentation start can be different. Also depends on how hard / less your data you need to have more regularization (like increasing dropout, reducing number of layers).

Finally, regarding train-WER/loss I should not worry if they are stuck, as long as dev metrics are decreasing?

Yep.

flashlight / wav2letter

Architecture related issue #667