flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.37k stars 1.01k forks source link

which wer cat you get when train streaming_convnets using librispeech only 1k hours data? #719

Open qzfnihao opened 4 years ago

qzfnihao commented 4 years ago

I try to reproduce the streaming_convnets in librispeech data on a 4-gpu machine, I find it hard to train all the data with libri-light, so I just use 1k hours data which librispeech produced. But I found the wer on dev-clean and dev-other are worse than the result in https://arxiv.org/pdf/1911.08460.pdf: tablet 5, it get wer on dev-other with wer 11.16 without decode, while I get a wer on dev-other 17.10 after 78 epochs.

my training arch use the default "am_500ms_future_context.arch": V -1 NFEAT 1 0 SAUG 80 27 2 100 1.0 2 PD 0 5 3 C2 1 15 10 1 2 1 0 0 R DO 0.1 LN 1 2 TDS 15 9 80 0.1 0 1 0 TDS 15 9 80 0.1 0 1 0 PD 0 7 1 C2 15 19 10 1 2 1 0 0 R DO 0.1 LN 1 2 TDS 19 9 80 0.1 0 1 0 TDS 19 9 80 0.1 0 1 0 TDS 19 9 80 0.1 0 1 0 PD 0 9 1 C2 19 23 12 1 2 1 0 0 R DO 0.1 LN 1 2 TDS 23 11 80 0.1 0 1 0 TDS 23 11 80 0.1 0 1 0 TDS 23 11 80 0.1 0 1 0 TDS 23 11 80 0.1 0 0 0 PD 0 10 0 C2 23 27 11 1 1 1 0 0 R DO 0.1 LN 1 2 TDS 27 11 80 0.1 0 0 0 TDS 27 11 80 0.1 0 0 0 TDS 27 11 80 0.1 0 0 0 TDS 27 11 80 0.1 0 0 0 TDS 27 11 80 0.1 0 0 0 RO 2 1 0 3 V 2160 -1 1 0 L 2160 NLABEL V NLABEL 0 -1 1

my training flags file is : --runname=inference_2019 --rundir=/root/wav2letter.debug/recipes/models/streaming_convnets/librispeech --tokensdir=/root/wav2letter.debug/recipes/models/streaming_convnets/librispeech/models/am --archdir=/root/wav2letter.debug/recipes/models/streaming_convnets/librispeech --train=/root/librispeech/lists/train-clean-100.lst,/root/librispeech/lists/train-clean-360.lst,/root/librispeech/lists/train-other-500.lst --valid=dev-clean:/root/librispeech/lists/dev-clean.lst,dev-other:/root/librispeech/lists/dev-other.lst --lexicon=/root/wav2letter.debug/recipes/models/streaming_convnets/librispeech/models/am/librispeech-train+dev-unigram-10000-nbest10.lexicon --arch=am_500ms_futurecontext.arch --tokens=librispeech-train-all-unigram-10000.tokens --criterion=ctc --batchsize=8 --lr=0.4 --momentum=0.0 --maxgradnorm=0.5 --reportiters=1000 --nthread=6 --mfsc=true --usewordpiece=true --wordseparator= --filterbanks=80 --minisz=200 --mintsz=2 --maxisz=33000 --enable_distributed=true --pcttraineval=1 --minloglevel=0 --logtostderr --onorm=target --sqnorm --localnrmlleftctx=300

my running script is : mpirun -n 4 --allow-run-as-root /root/wav2letter/build/Train train --flagsfile=train_am_500ms_future_context.local.cfg
I use coda-docker with Image id : 536665b6d0e7

I get the log such like this: epoch: 79 | nupdates: 686000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:05:39 | bch(ms): 339.41 | smp(ms): 1.78 | fwd(ms): 100.31 | crit-fwd(ms): 10.02 | bwd(ms): 213.69 | optim(ms): 23.15 | loss: 4.90068 | train-TER: 11.97 | train-WER: 18.47 | dev-clean-loss: 1.55145 | dev-clean-TER: 3.17 | dev-clean-WER: 7.24 | dev-other-loss: 3.53614 | dev-other-TER: 8.94 | dev-other-WER: 17.02 | avg-isz: 1238 | avg-tsz: 046 | max-tsz: 074 | hrs: 110.13 | thrpt(sec/sec): 1168.05 epoch: 79 | nupdates: 687000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:05:34 | bch(ms): 334.66 | smp(ms): 0.71 | fwd(ms): 98.45 | crit-fwd(ms): 9.89 | bwd(ms): 211.88 | optim(ms): 23.13 | loss: 4.88876 | train-TER: 12.07 | train-WER: 18.69 | dev-clean-loss: 1.56346 | dev-clean-TER: 3.16 | dev-clean-WER: 7.20 | dev-other-loss: 3.56405 | dev-other-TER: 8.90 | dev-other-WER: 16.85 | avg-isz: 1214 | avg-tsz: 045 | max-tsz: 077 | hrs: 107.93 | thrpt(sec/sec): 1160.98 epoch: 79 | nupdates: 688000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:05:33 | bch(ms): 333.08 | smp(ms): 0.70 | fwd(ms): 97.79 | crit-fwd(ms): 9.79 | bwd(ms): 210.92 | optim(ms): 23.18 | loss: 5.00854 | train-TER: 9.29 | train-WER: 14.68 | dev-clean-loss: 1.57584 | dev-clean-TER: 3.19 | dev-clean-WER: 7.29 | dev-other-loss: 3.61315 | dev-other-TER: 8.99 | dev-other-WER: 17.03 | avg-isz: 1206 | avg-tsz: 045 | max-tsz: 075 | hrs: 107.27 | thrpt(sec/sec): 1159.36 epoch: 79 | nupdates: 689000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:05:31 | bch(ms): 331.00 | smp(ms): 0.69 | fwd(ms): 96.90 | crit-fwd(ms): 9.75 | bwd(ms): 209.82 | optim(ms): 23.13 | loss: 4.97294 | train-TER: 14.93 | train-WER: 22.65 | dev-clean-loss: 1.54140 | dev-clean-TER: 3.17 | dev-clean-WER: 7.28 | dev-other-loss: 3.53977 | dev-other-TER: 9.02 | dev-other-WER: 17.10 | avg-isz: 1194 | avg-tsz: 045 | max-tsz: 079 | hrs: 106.18 | thrpt(sec/sec): 1154.80

Is there anything wrong in my training? How can I reproduce a model with a good performance using tds -ctc ?

lunixbochs commented 4 years ago

The TDS CTC model is different from streaming convnets:

https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/sota/2019/am_arch/am_tds_ctc.arch https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/sota/2019/librispeech/train_am_tds_ctc.cfg

TDS CTC is 1.6GB, while sconv is only 400MB. Your results look fine for the data size you put in.

If you want to take advantage of the librivox data, try transfer learning: https://github.com/facebookresearch/wav2letter/issues/577

qzfnihao commented 4 years ago

The TDS CTC model is different from streaming convnets:

https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/sota/2019/am_arch/am_tds_ctc.arch https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/sota/2019/librispeech/train_am_tds_ctc.cfg

TDS CTC is 1.6GB, while sconv is only 400MB. Your results look fine for the data size you put in.

If you want to take advantage of the librivox data, try transfer learning: #577

I find the streaming_convnets is using the efficientNet somehow. If i increase the tds layers userd in streaming convnets more deeper and wider, may the result will be better?

qzfnihao commented 4 years ago

I tried more tds layers, but get a litter improvence. V -1 NFEAT 1 0 SAUG 80 27 2 100 1.0 2 PD 0 5 3 C2 1 15 10 1 2 1 0 0 R DO 0.1 LN 1 2 TDS 15 9 80 0.1 0 1 0 TDS 15 9 80 0.1 0 1 0 PD 0 7 1 C2 15 19 10 1 2 1 0 0 R DO 0.1 LN 1 2 TDS 19 9 80 0.1 0 1 0 TDS 19 9 80 0.1 0 1 0 TDS 19 9 80 0.1 0 1 0 PD 0 9 1 C2 19 23 12 1 2 1 0 0 R DO 0.1 LN 1 2 TDS 23 11 80 0.1 0 1 0 TDS 23 11 80 0.1 0 1 0 TDS 23 11 80 0.1 0 1 0 TDS 23 11 80 0.1 0 0 0 PD 0 10 0 C2 23 27 11 1 1 1 0 0 R DO 0.1 LN 1 2 TDS 27 11 80 0.1 0 0 0 TDS 27 11 80 0.1 0 0 0 TDS 27 11 80 0.1 0 0 0 TDS 27 11 80 0.1 0 0 0 TDS 27 11 80 0.1 0 0 0 PD 0 10 0 C2 27 31 11 1 1 1 0 0 R DO 0.1 LN 1 2 TDS 31 11 80 0.1 0 0 0 TDS 31 11 80 0.1 0 0 0 TDS 31 11 80 0.1 0 0 0 TDS 31 11 80 0.1 0 0 0 TDS 31 11 80 0.1 0 0 0 TDS 31 11 80 0.1 0 0 0 RO 2 1 0 3 V 2480 -1 1 0 L 2480 NLABEL V NLABEL 0 -1 1

I got WER: 13.3285, LER: 7.17963 in test-other, while got WER 13.871, LER: 7.62503 in default arch in streaming_convents.

vineelpratap commented 4 years ago

Hi, the train-TER seems a bit high from the post. You might want to try fine-tuning --momentum, --dropout and also half the learning rate after every n (say 100) epochs using --lrdecay, --lr_decay_step options.

Also, note that the models in https://arxiv.org/pdf/1911.08460.pdf are non streaming and large. So, we can't really compare these. However, I think you should be able to get in the range of 10-11 viterbi WER with good fine-tuning.