Closed hiroaki-ogawa closed 5 years ago
Hi @hiroaki-ogawa,
Are you running training with 8 GPUs?
Hi @tlikhomanenko,
No, 4 GPUs (GTX-1080Ti x 4).
I have similar problems with training as well (see #395 ). I believe the recipes/models/seq2seq_tds/librispeech/network.arch is correct rather than the linked one from the README, so I am uncertain what the cause of the discrepancy is.
Hi @nimz , @tlikhomanenko ,
The AM-training itself looks good because Japanese character AM trained with same training parameter (excpet data, lexicon, and tokens) perform fine with charcter 4-gram. I suspect mismatch between my AM settings and pre-trained LM.
Followings is example of my list, lexicon, and tokens. Are they looks same as one in your environmes?
head -3 train-clean-100.lst
train-clean-100-103-1240-0000 /cache/data/LibriSpeech/train-clean-100/103/1240/103-1240-0000.flac 14085.0 chapter one missus rachel lynde is surprised missus rachel lynde lived just where the avonlea main road dipped down into a little hollow fringed with alders and ladies eardrops and traversed by a brook
train-clean-100-103-1240-0001 /cache/data/LibriSpeech/train-clean-100/103/1240/103-1240-0001.flac 15945.0 that had its source away back in the woods of the old cuthbert place it was reputed to be an intricate headlong brook in its earlier course through those woods with dark secrets of pool and cascade but by the time it reached lynde's hollow it was a quiet well conducted little stream
train-clean-100-103-1240-0002 /cache/data/LibriSpeech/train-clean-100/103/1240/103-1240-0002.flac 13945.0 for not even a brook could run past missus rachel lynde's door without due regard for decency and decorum it probably was conscious that missus rachel was sitting at her window keeping a sharp eye on everything that passed from brooks and children up
head -3 lexicon-train+dev-unigram-10000-nbest10.lexicon
a _ a
a _a
a'azam _ a ' a z a m
head -5 seq2seq_tds_librispeech/am/tokens-train-all-unigram-10000.tokens
_the
_and
_of
_to
_a
Yes, I ran those three commands to check the list, lexicons, and tokens and get the same outputs.
By the way, if you still have the training log would it be possible to post the output from the first 10 epochs? I just want to cross-check with my log from #388 to see if they are consistent and rule out some possibilities.
@nimz ,
this is head of my train log
epoch: 1 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:39 | bch(ms): 596.13 | smp(ms): 1.13 | fwd(ms): 253.06 | crit-fwd(ms): 33.48 | bwd(ms): 322.06 | optim(ms): 10.47 | loss: 561.57658 | train-LER: 253.29 | train-WER: 388.09 | dev-clean-loss: 235.81769 | dev-clean-LER: 407.98 | dev-clean-WER: 589.89 | dev-other-loss: 206.67301 | dev-other-LER: 474.91 | dev-other-WER: 669.18 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs: 960.78 | thrpt(sec/sec): 1320.15
epoch: 2 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:30 | bch(ms): 593.91 | smp(ms): 1.13 | fwd(ms): 247.81 | crit-fwd(ms): 32.79 | bwd(ms): 325.58 | optim(ms): 10.28 | loss: 436.51994 | train-LER: 253.29 | train-WER: 388.09 | dev-clean-loss: 188.60904 | dev-clean-LER: 407.98 | dev-clean-WER: 589.89 | dev-other-loss: 168.23695 | dev-other-LER: 474.91 | dev-other-WER: 669.18 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs: 960.78 | thrpt(sec/sec): 1325.09
epoch: 3 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:29 | bch(ms): 593.83 | smp(ms): 1.15 | fwd(ms): 249.30 | crit-fwd(ms): 32.83 | bwd(ms): 323.97 | optim(ms): 10.29 | loss: 295.13772 | train-LER: 253.29 | train-WER: 388.09 | dev-clean-loss: 121.92147 | dev-clean-LER: 407.98 | dev-clean-WER: 589.89 | dev-other-loss: 128.89053 | dev-other-LER: 474.91 | dev-other-WER: 669.18 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs: 960.78 | thrpt(sec/sec): 1325.28
epoch: 4 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:09 | bch(ms): 589.22 | smp(ms): 1.13 | fwd(ms): 247.83 | crit-fwd(ms): 32.58 | bwd(ms): 322.82 | optim(ms): 10.28 | loss: 206.21562 | train-LER: 50.54 | train-WER: 73.08 | dev-clean-loss: 43.07734 | dev-clean-LER: 25.53 | dev-clean-WER: 42.65 | dev-other-loss: 59.84784 | dev-other-LER: 33.95 | dev-other-WER: 57.69 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs: 960.78 | thrpt(sec/sec): 1335.64
epoch: 5 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:24 | bch(ms): 592.72 | smp(ms): 1.14 | fwd(ms): 250.39 | crit-fwd(ms): 32.95 | bwd(ms): 323.52 | optim(ms): 10.28 | loss: 172.13505 | train-LER: 37.31 | train-WER: 56.88 | dev-clean-loss: 35.02143 | dev-clean-LER: 24.84 | dev-clean-WER: 39.53 | dev-other-loss: 46.89210 | dev-other-LER: 33.46 | dev-other-WER: 55.53 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs: 960.78 | thrpt(sec/sec): 1327.76
epoch: 6 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:17 | bch(ms): 590.98 | smp(ms): 1.17 | fwd(ms): 247.89 | crit-fwd(ms): 33.24 | bwd(ms): 324.23 | optim(ms): 10.27 | loss: 156.29622 | train-LER: 33.86 | train-WER: 51.46 | dev-clean-loss: 28.48777 | dev-clean-LER: 19.86 | dev-clean-WER: 30.87 | dev-other-loss: 41.67006 | dev-other-LER: 27.45 | dev-other-WER: 45.11 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs: 960.78 | thrpt(sec/sec): 1331.66
epoch: 7 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:29 | bch(ms): 593.85 | smp(ms): 1.17 | fwd(ms): 250.68 | crit-fwd(ms): 33.82 | bwd(ms): 324.29 | optim(ms): 10.28 | loss: 147.08287 | train-LER: 29.97 | train-WER: 45.31 | dev-clean-loss: 24.70496 | dev-clean-LER: 19.46 | dev-clean-WER: 29.63 | dev-other-loss: 36.79603 | dev-other-LER: 27.04 | dev-other-WER: 44.65 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 322 | hrs: 960.78 | thrpt(sec/sec): 1325.22
epoch: 8 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:17 | bch(ms): 590.97 | smp(ms): 1.13 | fwd(ms): 248.02 | crit-fwd(ms): 33.53 | bwd(ms): 324.10 | optim(ms): 10.29 | loss: 140.45986 | train-LER: 27.56 | train-WER: 40.99 | dev-clean-loss: 22.59436 | dev-clean-LER: 17.87 | dev-clean-WER: 26.41 | dev-other-loss: 33.99397 | dev-other-LER: 24.53 | dev-other-WER: 40.27 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 322 | hrs: 960.78 | thrpt(sec/sec): 1331.69
epoch: 9 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:27 | bch(ms): 593.20 | smp(ms): 1.13 | fwd(ms): 252.04 | crit-fwd(ms): 33.42 | bwd(ms): 322.34 | optim(ms): 10.27 | loss: 135.38188 | train-LER: 26.40 | train-WER: 38.58 | dev-clean-loss: 21.89793 | dev-clean-LER: 16.61 | dev-clean-WER: 24.50 | dev-other-loss: 34.64992 | dev-other-LER: 21.85 | dev-other-WER: 36.13 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs: 960.78 | thrpt(sec/sec): 1326.67
epoch: 10 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:39 | bch(ms): 596.05 | smp(ms): 1.14 | fwd(ms): 255.04 | crit-fwd(ms): 32.45 | bwd(ms): 322.15 | optim(ms): 10.27 | loss: 131.94298 | train-LER: 25.29 | train-WER: 36.80 | dev-clean-loss: 20.12810 | dev-clean-LER: 16.26 | dev-clean-WER: 23.28 | dev-other-loss: 32.30247 | dev-other-LER: 21.30 | dev-other-WER: 35.04 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 323 | hrs: 960.78 | thrpt(sec/sec): 1320.34
Thanks. That's interesting; the loss and in fact all metrics are significantly higher from my first 10 epochs. It could be just random variation between training runs; hard to say. I will run it again to check.
Hi @nimz, @hiroaki-ogawa
Lexicon, tokens and list files looks correct. About correct arch file - the one from repository recipes/models/seq2seq_tds/librispeech/network.arch
is correct. Will fix the file from the link.
I suspect mismatch between my AM settings and pre-trained LM.
@hiroaki-ogawa I could expect that if you re-train the model and will have similar LER, WER on dev sets, then you can get similar results to the paper using provided pre-trained LM and decoder params. Looks like the AM model is still very bad and far from the number reported in the paper (for model without external LM). Even train WER/LER is too high. Could you plot training/dev loss, WER and LER from your logs and attach the graphs?
@nimz Are you running on 7 GPUs? I could expect difference in logs at least during first epochs if you train using different number of GPUs.
Will recheck with our logs.
Yes, I run on 7 GPUs.
Also, @hiroaki-ogawa's results were at epoch 94, while the AM in the paper was trained for 200 epochs I believe? Could that explain the difference?
Hi @tlikhomanenko , @nimz
Here are plot of the loss, WER and LER until epoch 137.
Following is a enlarged view of the LER.
Also, @hiroaki-ogawa's results were at epoch 94, while the AM in the paper was trained for 200 epochs I believe? Could that explain the difference?
Since loss and WER were landing too high place, I just stopped the training after the epoch 94 and wrote this issue. After that, I "continue"ed the training until epoch 137 and stopped again.
hi @hiroaki-ogawa, @nimz,
I sent the fix of the lexicon file generation cfbd32cad980af9a90bb7494da08a79588b39a9b. The order of spellings matters for the training because of the way how we sample segmentation of the words. Could you rerun prepare.py
again (or use the lexicon from the link in the README.md) and then run the training again?
Hi @tlikhomanenko ,
I looks fine! The training with the fixed lexicon is still at epoch 8 but it marks much better result.
Thank you
with old lexicon:
epoch: 8 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:43:17 | bch(ms): 590.97 | smp(ms): 1.13 | fwd(ms): 248.02 | crit-fwd(ms): 33.53 | bwd(ms): 324.10 | optim(ms): 10.29 | loss: 140.45986 | train-LER: 27.56 | train-WER: 40.99 | dev-clean-loss: 22.59436 | dev-clean-LER: 17.87 | dev-clean-WER: 26.41 | dev-other-loss: 33.99397 | dev-other-LER: 24.53 | dev-other-WER: 40.27 | avg-isz: 1229 | avg-tsz: 135 | max-tsz: 322 | hrs: 960.78 | thrpt(sec/sec): 1331.69
with fixed lexicon:
epoch: 8 | lr: 0.050000 | lrcriterion: 0.050000 | runtime: 00:38:12 | bch(ms): 521.63 | smp(ms): 1.14 | fwd(ms): 215.34 | crit-fwd(ms): 10.08 | bwd(ms): 291.81 | optim(ms): 10.33 | loss: 58.94729 | train-LER: 9.20 | train-WER: 18.51 | dev-clean-loss: 13.49021 | dev-clean-LER: 5.06 | dev-clean-WER: 11.58 | dev-other-loss: 22.02445 | dev-other-LER: 13.61 | dev-other-WER: 26.13 | avg-isz: 1229 | avg-tsz: 042 | max-tsz: 095 | hrs: 960.78 | thrpt(sec/sec): 1508.70
@tlikhomanenko, thanks so much for your help! Using the fixed lexicon I am able to train in around 10 min per epoch on 7 GPUs, and after evaluating using the pretrained convolutional LM I get 3.30% WER on test-clean and 10.11% on test-other, which seems within random variation. This is great!
Closing the issue. Feel free to reopen if it is needed.
Hi @tlikhomanenko ,
Just to know, I got WER 4.32 on test-other with ngram using the fixed lexicon. Thank you so much for your help!
Hi @hiroaki-ogawa, I think you meant test-clean, right?
Oops, yes I meant test-clean
Hi there, Thank you for your sharing great works!
The AM I trained with seq2seq_tds recipe gives me poor WER while pre-trained AM perform nicely with same decoder settings
I used
netwrok.arch
andtrain.cfg
at wav2letter/recipes/models/seq2seq_tds/librispeech/. Following flags intrain.cfg
are modifiedThe list, lexicon, and tokens are generated by
recipes/models/seq2seq_tds/librispeech/prepare.py
.network.arch
linked fromrecipes/models/seq2seq_tds/librispeech/README.md
is slightly different fromrecipes/models/seq2seq_tds/librispeech/network.arch
. I used latter one.This is the last part of the log of the training. WER is still very high.
I wonder where I mistook the way.. Thank you