flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

Reproduction of AM training (seq2seq_tds, librispeech) #392

Closed hiroaki-ogawa closed 5 years ago

hiroaki-ogawa commented 5 years ago

Hi there, Thank you for your sharing great works!

The AM I trained with seq2seq_tds recipe gives me poor WER while pre-trained AM perform nicely with same decoder settings

AM LM WER
pre-trained AM pre-trained LM (4gram) 4.32
reproduced AM(epoch 94) pre-trained LM (4gram) 23.7

I used netwrok.arch and train.cfg at wav2letter/recipes/models/seq2seq_tds/librispeech/. Following flags in train.cfg are modified

#--archdir=[...]
--archdir=.
#--train=[DATA_DST]/lists/train-clean-100.lst,[DATA_DST]/lists/train-clean-360.lst,[DATA_DST]/lists/train-other-500.lst
--train=./librispeech/lists/train-clean-100.lst,./librispeech/lists/train-clean-360.lst,./librispeech/lists/train-other-500.lst
--valid=dev-clean:./librispeech/lists/dev-clean.lst,dev-other:./librispeech/lists/dev-other.lst
#--lexicon=[MODEL_DST]/am/librispeech-train+dev-unigram-10000-nbest10.lexicon
--lexicon=seq2seq_tds_librispeech/am/lexicon-train+dev-unigram-10000-nbest10.lexicon
#--tokensdir=[MODEL_DST]/am
--tokensdir=seq2seq_tds_librispeech/am
--tokens=tokens-train-all-unigram-10000.tokens

The list, lexicon, and tokens are generated by recipes/models/seq2seq_tds/librispeech/prepare.py.

network.arch linked from recipes/models/seq2seq_tds/librispeech/README.md is slightly different from recipes/models/seq2seq_tds/librispeech/network.arch. I used latter one.

This is the last part of the log of the training. WER is still very high.

epoch:       93 | lr: 0.012500 | lrcriterion: 0.012500 | runtime: 00:43:29 | bch(ms): 593.84 | smp(ms): 1.16 | fwd(ms): 251.97 | crit-fwd(ms): 33.61 | bwd(ms): 322.98 | optim(ms): 10.28 | loss:  101.37423 | train-LER: 16.15 | train-WER: 19.08 | dev-clean-loss:   12.72969 | dev-clean-LER: 13.56 | dev-clean-WER: 16.58 | dev-other-loss:   22.85048 | dev-other-LER: 15.24 | dev-other-WER: 23.04 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs:  960.78 | thrpt(sec/sec): 1325.25
epoch:       94 | lr: 0.012500 | lrcriterion: 0.012500 | runtime: 00:43:33 | bch(ms): 594.63 | smp(ms): 1.16 | fwd(ms): 253.00 | crit-fwd(ms): 32.74 | bwd(ms): 322.81 | optim(ms): 10.28 | loss:  101.30106 | train-LER: 16.15 | train-WER: 19.11 | dev-clean-loss:   12.71779 | dev-clean-LER: 13.52 | dev-clean-WER: 16.55 | dev-other-loss:   23.03816 | dev-other-LER: 15.21 | dev-other-WER: 22.93 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs:  960.78 | thrpt(sec/sec): 1323.48

I wonder where I mistook the way.. Thank you

tlikhomanenko commented 5 years ago

Hi @hiroaki-ogawa,

Are you running training with 8 GPUs?

hiroaki-ogawa commented 5 years ago

Hi @tlikhomanenko,

No, 4 GPUs (GTX-1080Ti x 4).

nimz commented 5 years ago

I have similar problems with training as well (see #395 ). I believe the recipes/models/seq2seq_tds/librispeech/network.arch is correct rather than the linked one from the README, so I am uncertain what the cause of the discrepancy is.

hiroaki-ogawa commented 5 years ago

Hi @nimz , @tlikhomanenko ,

The AM-training itself looks good because Japanese character AM trained with same training parameter (excpet data, lexicon, and tokens) perform fine with charcter 4-gram. I suspect mismatch between my AM settings and pre-trained LM.

Followings is example of my list, lexicon, and tokens. Are they looks same as one in your environmes?

head -3 train-clean-100.lst

train-clean-100-103-1240-0000 /cache/data/LibriSpeech/train-clean-100/103/1240/103-1240-0000.flac 14085.0 chapter one missus rachel lynde is surprised missus rachel lynde lived just where the avonlea main road dipped down into a little hollow fringed with alders and ladies eardrops and traversed by a brook
train-clean-100-103-1240-0001 /cache/data/LibriSpeech/train-clean-100/103/1240/103-1240-0001.flac 15945.0 that had its source away back in the woods of the old cuthbert place it was reputed to be an intricate headlong brook in its earlier course through those woods with dark secrets of pool and cascade but by the time it reached lynde's hollow it was a quiet well conducted little stream
train-clean-100-103-1240-0002 /cache/data/LibriSpeech/train-clean-100/103/1240/103-1240-0002.flac 13945.0 for not even a brook could run past missus rachel lynde's door without due regard for decency and decorum it probably was conscious that missus rachel was sitting at her window keeping a sharp eye on everything that passed from brooks and children up

head -3 lexicon-train+dev-unigram-10000-nbest10.lexicon

a   _ a
a   _a
a'azam  _ a ' a z a m

head -5 seq2seq_tds_librispeech/am/tokens-train-all-unigram-10000.tokens

_the
_and
_of
_to
_a
nimz commented 5 years ago

Yes, I ran those three commands to check the list, lexicons, and tokens and get the same outputs.

By the way, if you still have the training log would it be possible to post the output from the first 10 epochs? I just want to cross-check with my log from #388 to see if they are consistent and rule out some possibilities.

hiroaki-ogawa commented 5 years ago

@nimz ,

this is head of my train log

epoch:        1 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:39 | bch(ms): 596.13 | smp(ms): 1.13 | fwd(ms): 253.06 | crit-fwd(ms): 33.48 | bwd(ms): 322.06 | optim(ms): 10.47 | loss:  561.57658 | train-LER: 253.29 | train-WER: 388.09 | dev-clean-loss:  235.81769 | dev-clean-LER: 407.98 | dev-clean-WER: 589.89 | dev-other-loss:  206.67301 | dev-other-LER: 474.91 | dev-other-WER: 669.18 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs:  960.78 | thrpt(sec/sec): 1320.15
epoch:        2 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:30 | bch(ms): 593.91 | smp(ms): 1.13 | fwd(ms): 247.81 | crit-fwd(ms): 32.79 | bwd(ms): 325.58 | optim(ms): 10.28 | loss:  436.51994 | train-LER: 253.29 | train-WER: 388.09 | dev-clean-loss:  188.60904 | dev-clean-LER: 407.98 | dev-clean-WER: 589.89 | dev-other-loss:  168.23695 | dev-other-LER: 474.91 | dev-other-WER: 669.18 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs:  960.78 | thrpt(sec/sec): 1325.09
epoch:        3 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:29 | bch(ms): 593.83 | smp(ms): 1.15 | fwd(ms): 249.30 | crit-fwd(ms): 32.83 | bwd(ms): 323.97 | optim(ms): 10.29 | loss:  295.13772 | train-LER: 253.29 | train-WER: 388.09 | dev-clean-loss:  121.92147 | dev-clean-LER: 407.98 | dev-clean-WER: 589.89 | dev-other-loss:  128.89053 | dev-other-LER: 474.91 | dev-other-WER: 669.18 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs:  960.78 | thrpt(sec/sec): 1325.28
epoch:        4 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:09 | bch(ms): 589.22 | smp(ms): 1.13 | fwd(ms): 247.83 | crit-fwd(ms): 32.58 | bwd(ms): 322.82 | optim(ms): 10.28 | loss:  206.21562 | train-LER: 50.54 | train-WER: 73.08 | dev-clean-loss:   43.07734 | dev-clean-LER: 25.53 | dev-clean-WER: 42.65 | dev-other-loss:   59.84784 | dev-other-LER: 33.95 | dev-other-WER: 57.69 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs:  960.78 | thrpt(sec/sec): 1335.64
epoch:        5 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:24 | bch(ms): 592.72 | smp(ms): 1.14 | fwd(ms): 250.39 | crit-fwd(ms): 32.95 | bwd(ms): 323.52 | optim(ms): 10.28 | loss:  172.13505 | train-LER: 37.31 | train-WER: 56.88 | dev-clean-loss:   35.02143 | dev-clean-LER: 24.84 | dev-clean-WER: 39.53 | dev-other-loss:   46.89210 | dev-other-LER: 33.46 | dev-other-WER: 55.53 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs:  960.78 | thrpt(sec/sec): 1327.76
epoch:        6 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:17 | bch(ms): 590.98 | smp(ms): 1.17 | fwd(ms): 247.89 | crit-fwd(ms): 33.24 | bwd(ms): 324.23 | optim(ms): 10.27 | loss:  156.29622 | train-LER: 33.86 | train-WER: 51.46 | dev-clean-loss:   28.48777 | dev-clean-LER: 19.86 | dev-clean-WER: 30.87 | dev-other-loss:   41.67006 | dev-other-LER: 27.45 | dev-other-WER: 45.11 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs:  960.78 | thrpt(sec/sec): 1331.66
epoch:        7 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:29 | bch(ms): 593.85 | smp(ms): 1.17 | fwd(ms): 250.68 | crit-fwd(ms): 33.82 | bwd(ms): 324.29 | optim(ms): 10.28 | loss:  147.08287 | train-LER: 29.97 | train-WER: 45.31 | dev-clean-loss:   24.70496 | dev-clean-LER: 19.46 | dev-clean-WER: 29.63 | dev-other-loss:   36.79603 | dev-other-LER: 27.04 | dev-other-WER: 44.65 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 322 | hrs:  960.78 | thrpt(sec/sec): 1325.22
epoch:        8 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:17 | bch(ms): 590.97 | smp(ms): 1.13 | fwd(ms): 248.02 | crit-fwd(ms): 33.53 | bwd(ms): 324.10 | optim(ms): 10.29 | loss:  140.45986 | train-LER: 27.56 | train-WER: 40.99 | dev-clean-loss:   22.59436 | dev-clean-LER: 17.87 | dev-clean-WER: 26.41 | dev-other-loss:   33.99397 | dev-other-LER: 24.53 | dev-other-WER: 40.27 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 322 | hrs:  960.78 | thrpt(sec/sec): 1331.69
epoch:        9 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:27 | bch(ms): 593.20 | smp(ms): 1.13 | fwd(ms): 252.04 | crit-fwd(ms): 33.42 | bwd(ms): 322.34 | optim(ms): 10.27 | loss:  135.38188 | train-LER: 26.40 | train-WER: 38.58 | dev-clean-loss:   21.89793 | dev-clean-LER: 16.61 | dev-clean-WER: 24.50 | dev-other-loss:   34.64992 | dev-other-LER: 21.85 | dev-other-WER: 36.13 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs:  960.78 | thrpt(sec/sec): 1326.67
epoch:       10 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:39 | bch(ms): 596.05 | smp(ms): 1.14 | fwd(ms): 255.04 | crit-fwd(ms): 32.45 | bwd(ms): 322.15 | optim(ms): 10.27 | loss:  131.94298 | train-LER: 25.29 | train-WER: 36.80 | dev-clean-loss:   20.12810 | dev-clean-LER: 16.26 | dev-clean-WER: 23.28 | dev-other-loss:   32.30247 | dev-other-LER: 21.30 | dev-other-WER: 35.04 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs:  960.78 | thrpt(sec/sec): 1320.34
nimz commented 5 years ago

Thanks. That's interesting; the loss and in fact all metrics are significantly higher from my first 10 epochs. It could be just random variation between training runs; hard to say. I will run it again to check.

tlikhomanenko commented 5 years ago

Hi @nimz, @hiroaki-ogawa Lexicon, tokens and list files looks correct. About correct arch file - the one from repository recipes/models/seq2seq_tds/librispeech/network.arch is correct. Will fix the file from the link.

I suspect mismatch between my AM settings and pre-trained LM.

@hiroaki-ogawa I could expect that if you re-train the model and will have similar LER, WER on dev sets, then you can get similar results to the paper using provided pre-trained LM and decoder params. Looks like the AM model is still very bad and far from the number reported in the paper (for model without external LM). Even train WER/LER is too high. Could you plot training/dev loss, WER and LER from your logs and attach the graphs?

@nimz Are you running on 7 GPUs? I could expect difference in logs at least during first epochs if you train using different number of GPUs.

Will recheck with our logs.

nimz commented 5 years ago

Yes, I run on 7 GPUs.

Also, @hiroaki-ogawa's results were at epoch 94, while the AM in the paper was trained for 200 epochs I believe? Could that explain the difference?

hiroaki-ogawa commented 5 years ago

Hi @tlikhomanenko , @nimz

Here are plot of the loss, WER and LER until epoch 137.

plot of loss for each epoch plot of WER for each epoch plot of LER for each epoch

Following is a enlarged view of the LER. plot of LER (epoch 70-137)

Also, @hiroaki-ogawa's results were at epoch 94, while the AM in the paper was trained for 200 epochs I believe? Could that explain the difference?

Since loss and WER were landing too high place, I just stopped the training after the epoch 94 and wrote this issue. After that, I "continue"ed the training until epoch 137 and stopped again.

tlikhomanenko commented 5 years ago

hi @hiroaki-ogawa, @nimz,

I sent the fix of the lexicon file generation cfbd32cad980af9a90bb7494da08a79588b39a9b. The order of spellings matters for the training because of the way how we sample segmentation of the words. Could you rerun prepare.py again (or use the lexicon from the link in the README.md) and then run the training again?

hiroaki-ogawa commented 5 years ago

Hi @tlikhomanenko ,

I looks fine! The training with the fixed lexicon is still at epoch 8 but it marks much better result.

Thank you

with old lexicon:

epoch:        8 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:17 | bch(ms): 590.97 | smp(ms): 1.13 | fwd(ms): 248.02 | crit-fwd(ms): 33.53 | bwd(ms): 324.10 | optim(ms): 10.29 | loss:  140.45986 | train-LER: 27.56 | train-WER: 40.99 | dev-clean-loss:   22.59436 | dev-clean-LER: 17.87 | dev-clean-WER: 26.41 | dev-other-loss:   33.99397 | dev-other-LER: 24.53 | dev-other-WER: 40.27 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 322 | hrs:  960.78 | thrpt(sec/sec): 1331.69

with fixed lexicon:

epoch:        8 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:38:12 | bch(ms): 521.63 | smp(ms): 1.14 | fwd(ms): 215.34 | crit-fwd(ms): 10.08 | bwd(ms): 291.81 | optim(ms): 10.33 | loss:   58.94729 | train-LER:  9.20 | train-WER: 18.51 | dev-clean-loss:   13.49021 | dev-clean-LER:  5.06 | dev-clean-WER: 11.58 | dev-other-loss:   22.02445 | dev-other-LER: 13.61 | dev-other-WER: 26.13 | avg-isz: 1229 | avg-tsz: 042 | max-tsz: 095 | hrs:  960.78 | thrpt(sec/sec): 1508.70
nimz commented 5 years ago

@tlikhomanenko, thanks so much for your help! Using the fixed lexicon I am able to train in around 10 min per epoch on 7 GPUs, and after evaluating using the pretrained convolutional LM I get 3.30% WER on test-clean and 10.11% on test-other, which seems within random variation. This is great!

tlikhomanenko commented 5 years ago

Closing the issue. Feel free to reopen if it is needed.

hiroaki-ogawa commented 5 years ago

Hi @tlikhomanenko ,

Just to know, I got WER 4.32 on test-other with ngram using the fixed lexicon. Thank you so much for your help!

tlikhomanenko commented 5 years ago

Hi @hiroaki-ogawa, I think you meant test-clean, right?

hiroaki-ogawa commented 5 years ago

Oops, yes I meant test-clean