flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

how to support userwords function? #693

Closed luweishuang closed 3 years ago

luweishuang commented 4 years ago

As many ASR egines support userwords function. It means the origin asr result of sample.wav is "中美数控"。And if you using userwords.txt which has "中美速控" in it, the finally asr result is "中美速控" instead of "中美数控". It acts like asr result corrector and I think I should do something with lm's word frequency to get the function but I don't know where to add it? I'm using kenlm samples.zip

tlikhomanenko commented 4 years ago

@luweishuang

do you want to do post-processing? or on the fly during decoding? Not sure, I understood functionality you want to add.

abhinavkulkarni commented 4 years ago

@tlikhomanenko:

Not the original poster, but here's my question.

I am currently using the model described in the inference section of the Wiki.

I see that the model has a subword units file (token.txt) and a pronunciation file (lexicon.txt).

If I wanted to introduce a new word, say coronavirus which currently isn't recognized by the system, how do I go about it?

  1. Would I create an entry into lexicon.txt detailing how the word is pronounced in terms of it's subword units?
  2. Do I need to retrain the language model to introduce this word?

Thanks, Abhinav

luweishuang commented 4 years ago

@tlikhomanenko I want to do it on the fly during decoding because post-processing need load language model again and I think it should be merge in Decoding module.

davidbelle commented 4 years ago

On top of @abhinavkulkarni and @luweishuang questions, could you possibly answer mine as well which is along the same kind of lines.

Is there a way to increase the probability that wav2letter decoder will match common combinations of words? For example a lot of the audio that I am transcribing has street or suburb names which have multiple words "King Street", "Homebush bay drive". I have a list of these, just not sure how to tell wav2letter about them.

Thanks!

tlikhomanenko commented 4 years ago

@luweishuang

I still a bit confused what correction you want to do and why LM is needed here. Could you give an example in English to have more context?

@abhinavkulkarni

As soon as you run lexicon-based decoding then only words from the lexicon file can be inferred. So if you want to infer some new word, you need to add it to the lexicon file. Because AM model is trained on the word-pieces it potentially can infer this new word. About LM - it is word-based and that is why on new words (not from its vocab) it will predict it as unk (so unk score will be added). Here could be two cases: 1) it could be ok and we will rely on AM possibility to infer it, 2) AM cannot infer and LM here cannot help. You can try first to add to the lexicon and see how it behaves. If bad then you can retrain LM with larger vocab and train on sentences with these new words (in case of ngram you can train another LM on additional data and do interpolation, or retrain on a total corpus). Another solution is to use lexicon-free decoding (not sure it is supported in the inference pipeline right now) where you have token-based LM and potentially it can infer new unknown words.

@davidbelle

One solution I see (which is not too complicated) in the sense that you have priors not on unigrams but on >=1 ngrams is that you train ngram lm on your set of special phrases. Then you can add additional lm into decoder and its weight. And during decoding you will optimize AM_score + alpha lm + beta lm_special + gamma * word_score. How about this? This is not supported, so you could try to change the code a bit to have this. Another solution is to train LM including these data too, then LM should learn there appearance and context of your phrases.

davidbelle commented 4 years ago

Thanks so much. Very helpful.

I’m gonna try my hand at the second option and see how we go 👍

davidbelle commented 4 years ago

Hi @tlikhomanenko

I'm still having a lot of difficulty with this. I have tried lots of different things but I'll post here what I've tried / what my train of thoughts so and hopefully you can help me understand. This is based off your comment regarding adding your own custom word to the lexicon file.

The word I am experimenting with is "firecomm". I figured since both "fire" and "comm" or "com" are both in the tokens file this should work ok. I added the following line into my lexicon file (keeping it in alphabetical order) "firecomm _firecomm" I noticed other words have a few variations such as "_fire comm" or "_firecom m", I experimented with a few of these too. But anyway, running Decoder crashes with this:

terminate called after throwing an instance of 'std::invalid_argument' what(): Unknown entry in dictionary: '_firecomm'

I tried adding "_firecomm" to the bottom of my tokens file and now every third or fourth word is firecomm. I initially had it in the middle of the tokens file (randomly) and every third of fourth word was "bewitched" which is near the bottom of the file. Would love to hear your thoughts on what I'm doing wrong.

For context, I am using pretrained sota model CTC resnet.

decoder config file:

--am=/1tb/models/am_resnet_ctc_librivox_dev_clean.bin
--tokensdir=/1tb/models
--tokens=librispeech-train-all-unigram-10000.tokens
--lexicon=/1tb/sota/models/decoder/decoder-unigram-10000-nbest10.lexicon
--lm=/1tb/sota/models/decoder/lm_librispeech_kenlm_wp_10k_6gram_pruning_000012.bin
--datadir=/1tb/tnv
--test=testing.list
--uselexicon=true
--decodertype=wrd
--lmtype=kenlm
--silscore=0
--beamsize=500
--beamsizetoken=100
--beamthreshold=100
--nthread_decoder=3
--smearing=max
--show
--showletters
--lmweight=0.86994439339913
--wordscore=0.58878028376141
--maxtsz=1000000000
--maxisz=1000000000
--minisz=0
--mintsz=0

I have 3 GPU's, 6 GB of video memory each if that helps?

I will keep trying the other options for now (such as training a brand new language model) until I get stuck again.

tlikhomanenko commented 4 years ago

First, you cannot add some new tokens to the tokens set for trained model. Model will output probs with size number of tokens it trained, so in this case you predict token on position K and it will output it. So if you just change position K in tokens file the model will output this token instead the correct one it trained. So tokens file just gives mapping between prob index to the token.

Second, you cannot add word as "firecomm _firecomm" it should be constructed from the tokens you trained on. So, try to apply word piece model (as we have for data pre-processing, and lexicon construction) to the "firecomm" (I guess it will give you something "_fire c o m m" (if no tokens like "comm", "com", etc., then it will spell with letters sequence, while "_fire" for sure exists).

davidbelle commented 4 years ago

@tlikhomanenko thanks so much for your patience.

Ah ok. I misunderstood your comment earlier than (That approach did seem way too easy...).

I have already started with the instructions in recipes/models/sota/2019/lm about data preparation. Running the prepare_wp_data.py against the existing librepseech text + my own text worked great. I can see lm_wp_10k.train file contains tokens added as a result of my own text. Following the steps after that I'm not seeing any difference but it's possible I missed something. Going back over it again to be sure. Thanks.

tlikhomanenko commented 4 years ago

@davidbelle didn't get what the problem you have now. Could you be more precise?

davidbelle commented 4 years ago

Hi.

I don’t think there is a problem, it’s just not able to transcribe that word and I will have to try retraining the acoustic model.

Can I use “Train fork” on pre built models? Should work right?

tlikhomanenko commented 4 years ago

Yep, in the case you will use the same tokens set and you will use the word-piece sequence for your new words in the same way as I listed above.

davidbelle commented 4 years ago

Hi @tlikhomanenko

Is there a minimum total sample size you should expect to work on fine tuning a pre-built model? I am slowly building my own dataset with custom words and phrases. So far I only have 3 hours. Is that enough to fine tune a model?

Currently my loss output graphs are like this:

image

No idea what I'm doing wrong (unless it's simply not a big enough data set to fine tune the model).

For reference, I am trying to fine tune sota ctc resnet model. I was able to create the sentencepieces for my custom words. I'm not seeing any errors. I have copied the flagsfile straight from the recipe and have had to modify the --batchsize to 1 because of out-of-memory errors. It is pointing to my new lexicon file which includes both pre-built lexicons and my new lexicons.

I have 820 samples (638 in training, 182 in test) I have 3 GPU's. I have set the iter value to 136667 ((500 * 820) / 3) I am sampling up the 8K audio to 16k audio. This does work transcribing audio using the prebuilt models so I would expect this to work here too. As a heads up, the audio has come from CB radios transmitted over the air at 8K sample rate so the quality is poor. However the existing pre-built models do a pretty good job as is. Perhaps it is related to this? Some of the training audio is possibly too long? The longest file is 90 seconds, the average is 12.

Any help would be greatly appreciated!!

davidbelle commented 4 years ago

Forgot to include the log: Here's the top and last 10 lines which is when I stopped training.

epoch:        1 | nupdates:           10 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:00:45 | bch(ms): 4587.87 | smp(ms): 23.88 | fwd(ms): 351.46 | crit-fwd(ms): 5.15 | bwd(ms): 3859.92 | optim(ms): 207.39 | loss:   35.84993 | train-LER: 73.91 | train-WER: 96.86 | valid.list-loss:   23.11712 | valid.list-LER: 94.13 | valid.list-WER: 98.93 | avg-isz: 1290 | avg-tsz: 152 | max-tsz: 456 | hrs:    0.11 | thrpt(sec/sec): 8.44
epoch:        1 | nupdates:           20 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:00:48 | bch(ms): 4843.78 | smp(ms): 9.58 | fwd(ms): 108.01 | crit-fwd(ms): 4.03 | bwd(ms): 4625.26 | optim(ms): 100.48 | loss:   24.86444 | train-LER: 86.33 | train-WER: 99.23 | valid.list-loss:   23.37962 | valid.list-LER: 99.81 | valid.list-WER: 99.61 | avg-isz: 1123 | avg-tsz: 149 | max-tsz: 314 | hrs:    0.09 | thrpt(sec/sec): 6.96
epoch:        1 | nupdates:           30 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:00:47 | bch(ms): 4799.04 | smp(ms): 5.57 | fwd(ms): 132.50 | crit-fwd(ms): 4.75 | bwd(ms): 4558.73 | optim(ms): 100.50 | loss:   25.93582 | train-LER: 90.46 | train-WER: 99.77 | valid.list-loss:   25.73560 | valid.list-LER: 93.17 | valid.list-WER: 100.00 | avg-isz: 1247 | avg-tsz: 143 | max-tsz: 341 | hrs:    0.10 | thrpt(sec/sec): 7.80
epoch:        1 | nupdates:           40 | lr: 0.399999 | lrcriterion: 0.000000 | runtime: 00:00:49 | bch(ms): 4920.93 | smp(ms): 5.12 | fwd(ms): 166.89 | crit-fwd(ms): 5.35 | bwd(ms): 4647.92 | optim(ms): 98.54 | loss:   28.65364 | train-LER: 87.34 | train-WER: 99.91 | valid.list-loss:   18.53787 | valid.list-LER: 87.59 | valid.list-WER: 100.00 | avg-isz: 1626 | avg-tsz: 185 | max-tsz: 528 | hrs:    0.14 | thrpt(sec/sec): 9.92
epoch:        1 | nupdates:           50 | lr: 0.399999 | lrcriterion: 0.000000 | runtime: 00:00:47 | bch(ms): 4760.10 | smp(ms): 3.22 | fwd(ms): 77.21 | crit-fwd(ms): 2.75 | bwd(ms): 4577.10 | optim(ms): 99.24 | loss:   18.23863 | train-LER: 85.34 | train-WER: 98.72 | valid.list-loss:   17.31584 | valid.list-LER: 89.20 | valid.list-WER: 99.66 | avg-isz: 776 | avg-tsz: 101 | max-tsz: 265 | hrs:    0.06 | thrpt(sec/sec): 4.89
epoch:        1 | nupdates:           60 | lr: 0.399998 | lrcriterion: 0.000000 | runtime: 00:00:47 | bch(ms): 4736.57 | smp(ms): 2.73 | fwd(ms): 78.75 | crit-fwd(ms): 3.53 | bwd(ms): 4551.37 | optim(ms): 99.70 | loss:   26.29023 | train-LER: 87.42 | train-WER: 99.68 | valid.list-loss:   19.84615 | valid.list-LER: 80.93 | valid.list-WER: 99.95 | avg-isz: 845 | avg-tsz: 105 | max-tsz: 197 | hrs:    0.07 | thrpt(sec/sec): 5.36
epoch:        1 | nupdates:           70 | lr: 0.399997 | lrcriterion: 0.000000 | runtime: 00:00:48 | bch(ms): 4809.60 | smp(ms): 0.71 | fwd(ms): 106.07 | crit-fwd(ms): 4.40 | bwd(ms): 4598.09 | optim(ms): 100.70 | loss:   24.12762 | train-LER: 87.03 | train-WER: 99.74 | valid.list-loss:   19.64658 | valid.list-LER: 79.29 | valid.list-WER: 98.77 | avg-isz: 1055 | avg-tsz: 125 | max-tsz: 271 | hrs:    0.09 | thrpt(sec/sec): 6.59
epoch:        1 | nupdates:           80 | lr: 0.399995 | lrcriterion: 0.000000 | runtime: 00:00:57 | bch(ms): 5731.91 | smp(ms): 9.45 | fwd(ms): 162.02 | crit-fwd(ms): 5.74 | bwd(ms): 5462.79 | optim(ms): 98.46 | loss:   34.14181 | train-LER: 82.25 | train-WER: 101.94 | valid.list-loss:   23.08597 | valid.list-LER: 57.66 | valid.list-WER: 86.90 | avg-isz: 1498 | avg-tsz: 171 | max-tsz: 521 | hrs:    0.12 | thrpt(sec/sec): 7.84
epoch:        1 | nupdates:           90 | lr: 0.399993 | lrcriterion: 0.000000 | runtime: 00:00:55 | bch(ms): 5582.12 | smp(ms): 11.12 | fwd(ms): 116.59 | crit-fwd(ms): 3.25 | bwd(ms): 5357.49 | optim(ms): 97.47 | loss:   29.96360 | train-LER: 90.81 | train-WER: 99.37 | valid.list-loss:   41.05462 | valid.list-LER: 56.80 | valid.list-WER: 84.65 | avg-isz: 1100 | avg-tsz: 132 | max-tsz: 529 | hrs:    0.09 | thrpt(sec/sec): 5.91
epoch:        1 | nupdates:          100 | lr: 0.399991 | lrcriterion: 0.000000 | runtime: 00:00:57 | bch(ms): 5766.80 | smp(ms): 6.41 | fwd(ms): 205.87 | crit-fwd(ms): 7.04 | bwd(ms): 5454.22 | optim(ms): 98.02 | loss:   43.39190 | train-LER: 87.01 | train-WER: 99.93 | valid.list-loss:   47.03883 | valid.list-LER: 98.07 | valid.list-WER: 98.17 | avg-isz: 2027 | avg-tsz: 222 | max-tsz: 646 | hrs:    0.17 | thrpt(sec/sec): 10.55

epoch:        6 | nupdates:         1020 | lr: 0.399071 | lrcriterion: 0.000000 | runtime: 00:00:48 | bch(ms): 4801.97 | smp(ms): 7.82 | fwd(ms): 113.66 | crit-fwd(ms): 4.31 | bwd(ms): 4580.00 | optim(ms): 98.33 | loss:   21.12011 | train-LER: 88.68 | train-WER: 100.46 | valid.list-loss:   16.49451 | valid.list-LER: 87.09 | valid.list-WER: 99.58 | avg-isz: 1166 | avg-tsz: 141 | max-tsz: 322 | hrs:    0.10 | thrpt(sec/sec): 7.29
epoch:        6 | nupdates:         1030 | lr: 0.398794 | lrcriterion: 0.000000 | runtime: 00:00:47 | bch(ms): 4776.84 | smp(ms): 7.42 | fwd(ms): 96.00 | crit-fwd(ms): 2.91 | bwd(ms): 4572.96 | optim(ms): 97.89 | loss:   17.46377 | train-LER: 93.17 | train-WER: 99.86 | valid.list-loss:   16.45894 | valid.list-LER: 87.14 | valid.list-WER: 99.90 | avg-isz: 971 | avg-tsz: 113 | max-tsz: 392 | hrs:    0.08 | thrpt(sec/sec): 6.10
epoch:        6 | nupdates:         1040 | lr: 0.398511 | lrcriterion: 0.000000 | runtime: 00:00:50 | bch(ms): 5084.53 | smp(ms): 4.64 | fwd(ms): 233.22 | crit-fwd(ms): 7.43 | bwd(ms): 4746.37 | optim(ms): 99.01 | loss:   24.52860 | train-LER: 90.44 | train-WER: 99.43 | valid.list-loss:   16.66054 | valid.list-LER: 98.48 | valid.list-WER: 100.00 | avg-isz: 2176 | avg-tsz: 229 | max-tsz: 686 | hrs:    0.18 | thrpt(sec/sec): 12.84
epoch:        6 | nupdates:         1050 | lr: 0.398224 | lrcriterion: 0.000000 | runtime: 00:00:47 | bch(ms): 4757.21 | smp(ms): 2.90 | fwd(ms): 88.36 | crit-fwd(ms): 3.16 | bwd(ms): 4562.94 | optim(ms): 97.91 | loss:   16.88405 | train-LER: 92.73 | train-WER: 100.29 | valid.list-loss:   16.70688 | valid.list-LER: 99.04 | valid.list-WER: 99.53 | avg-isz: 908 | avg-tsz: 111 | max-tsz: 240 | hrs:    0.08 | thrpt(sec/sec): 5.73
epoch:        6 | nupdates:         1060 | lr: 0.397931 | lrcriterion: 0.000000 | runtime: 00:00:47 | bch(ms): 4759.71 | smp(ms): 5.21 | fwd(ms): 101.80 | crit-fwd(ms): 3.46 | bwd(ms): 4550.17 | optim(ms): 98.84 | loss:   16.65116 | train-LER: 94.25 | train-WER: 99.44 | valid.list-loss:   17.55288 | valid.list-LER: 91.90 | valid.list-WER: 99.79 | avg-isz: 938 | avg-tsz: 118 | max-tsz: 273 | hrs:    0.08 | thrpt(sec/sec): 5.92
epoch:        6 | nupdates:         1070 | lr: 0.397632 | lrcriterion: 0.000000 | runtime: 00:00:48 | bch(ms): 4864.14 | smp(ms): 3.48 | fwd(ms): 127.17 | crit-fwd(ms): 4.41 | bwd(ms): 4631.69 | optim(ms): 97.72 | loss:   18.29403 | train-LER: 94.81 | train-WER: 99.79 | valid.list-loss:   16.43875 | valid.list-LER: 93.23 | valid.list-WER: 100.03 | avg-isz: 1255 | avg-tsz: 151 | max-tsz: 456 | hrs:    0.10 | thrpt(sec/sec): 7.74
epoch:        6 | nupdates:         1080 | lr: 0.397329 | lrcriterion: 0.000000 | runtime: 00:00:48 | bch(ms): 4867.46 | smp(ms): 7.16 | fwd(ms): 136.72 | crit-fwd(ms): 4.87 | bwd(ms): 4624.53 | optim(ms): 98.08 | loss:   18.66156 | train-LER: 90.88 | train-WER: 99.81 | valid.list-loss:   16.30794 | valid.list-LER: 90.83 | valid.list-WER: 99.71 | avg-isz: 1426 | avg-tsz: 174 | max-tsz: 558 | hrs:    0.12 | thrpt(sec/sec): 8.79
epoch:        6 | nupdates:         1090 | lr: 0.397020 | lrcriterion: 0.000000 | runtime: 00:00:48 | bch(ms): 4872.50 | smp(ms): 6.05 | fwd(ms): 137.99 | crit-fwd(ms): 4.49 | bwd(ms): 4628.75 | optim(ms): 97.72 | loss:   18.53289 | train-LER: 88.25 | train-WER: 99.80 | valid.list-loss:   16.21484 | valid.list-LER: 96.39 | valid.list-WER: 99.90 | avg-isz: 1318 | avg-tsz: 161 | max-tsz: 473 | hrs:    0.11 | thrpt(sec/sec): 8.11
epoch:        6 | nupdates:         1100 | lr: 0.396705 | lrcriterion: 0.000000 | runtime: 00:00:47 | bch(ms): 4786.71 | smp(ms): 7.85 | fwd(ms): 105.63 | crit-fwd(ms): 4.75 | bwd(ms): 4577.28 | optim(ms): 97.39 | loss:   16.04783 | train-LER: 86.90 | train-WER: 99.56 | valid.list-loss:   15.88104 | valid.list-LER: 93.19 | valid.list-WER: 99.79 | avg-isz: 943 | avg-tsz: 110 | max-tsz: 291 | hrs:    0.08 | thrpt(sec/sec): 5.91
epoch:        6 | nupdates:         1110 | lr: 0.396385 | lrcriterion: 0.000000 | runtime: 00:00:49 | bch(ms): 4949.81 | smp(ms): 7.08 | fwd(ms): 169.98 | crit-fwd(ms): 5.98 | bwd(ms): 4674.02 | optim(ms): 97.14 | loss:   21.82407 | train-LER: 88.94 | train-WER: 99.49 | valid.list-loss:   16.28537 | valid.list-LER: 90.90 | valid.list-WER: 99.79 | avg-isz: 1592 | avg-tsz: 196 | max-tsz: 520 | hrs:    0.13 | thrpt(sec/sec): 9.65

And my flagsfile (iter is passed through as a command line argument).

--rundir=/1tb/training/model --archdir=/1tb/models/arch --arch=am_resnetctc.arch --tokensdir=/1tb/models/production --tokens=librispeech-train-all-unigram-10000.tokens --datadir=/1tb/training/text --lexicon=/1tb/training/fairseq/decoder.lexicon --train=test.list,dev.list --valid=valid.list --criterion=ctc --mfsc --labelsmooth=0.05 --wordseparator= --usewordpiece=true --sampletarget=0.01 --lr=0.4 --linseg=0 --momentum=0.6 --maxgradnorm=1 --onorm=target --sqnorm --nthread=4 --batchsize=1 --filterbanks=80 --lrcosine --minloglevel=0 --mintsz=2 --minisz=200 --reportiters=10 --logtostderr --enable_distributed

tlikhomanenko commented 4 years ago

Are you finetuning only the last linear layer which maps to the tokens set? I think you should start from it at first having small dataset. Also as soon as you have smaller batchsize you need to tune learning rate, possibly try to decrease it. Are you using the specaug (you can check if in the log it is in the printed layers) - then at least I can explain the variation in the training loss. Can you post also the plot with Vitebri WER?

davidbelle commented 4 years ago

Thanks @tlikhomanenko

I will try to do as you have said using the example found in #507. In #737 you mentioned the last layer will need to be deleted and re-created again, not sure how to do that programmatically but hopefully I'll figure it out. I played around with lowering the lr earlier, will try again once I have figured out how to recreate the last layer as mentioned above.

Here is the Valid WER and Train WER image

I did indeed find specaug in the output. See below.

I0716 11:22:56.841650 23719 Train.cpp:114] Parsing command line flags
I0716 11:22:56.841660 23719 Train.cpp:115] Overriding flags should be mutable when using `fork`
I0716 11:22:56.841686 23719 Train.cpp:120] Reading flags from file/1tb/training/train.cfg
I0716 11:22:56.841650 23720 Train.cpp:114] Parsing command line flags
I0716 11:22:56.841660 23720 Train.cpp:115] Overriding flags should be mutable when using `fork`
I0716 11:22:56.841686 23720 Train.cpp:120] Reading flags from file/1tb/training/train.cfg
I0716 11:22:56.841656 23721 Train.cpp:114] Parsing command line flags
I0716 11:22:56.841678 23721 Train.cpp:115] Overriding flags should be mutable when using `fork`
I0716 11:22:56.841689 23721 Train.cpp:120] Reading flags from file/1tb/training/train.cfg
Initialized NCCL 2.4.8 successfully!
I0716 11:22:57.475662 23719 Train.cpp:151] Gflags after parsing 
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=am_resnet_ctc.arch; --archdir=/1tb/models/arch; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=1; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/1tb/training/text; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/1tb/training/train.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=148500; --itersave=false; --labelsmooth=0.050000000000000003; --leftWindowSize=50; --lexicon=/1tb/training/fairseq/decoder.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.40000000000000002; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=true; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=1; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=200; --minrate=3; --minsil=0; --mintsz=2; --momentum=0.59999999999999998; --netoptim=sgd; --noresample=false; --nthread=4; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=100; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=10; --rightWindowSize=50; --rndv_filepath=; --rundir=/1tb/training/model; --runname=tnvdata; --samplerate=16000; --sampletarget=0.01; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=; --seed=0; --show=false; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=ltr; --test=; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/1tb/models/production; --train=test.list,dev.list; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=valid.list; --warmup=1; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; 
I0716 11:22:57.476999 23719 Train.cpp:152] Experiment path: /1tb/training/model/tnvdata
I0716 11:22:57.477013 23719 Train.cpp:153] Experiment runidx: 1
I0716 11:22:57.490942 23720 Train.cpp:199] Number of classes (network): 9998
I0716 11:22:57.492189 23719 Train.cpp:199] Number of classes (network): 9998
I0716 11:22:57.492309 23721 Train.cpp:199] Number of classes (network): 9998
I0716 11:22:58.880997 23719 Train.cpp:206] Number of words: 200356
I0716 11:22:58.883731 23720 Train.cpp:206] Number of words: 200356
I0716 11:22:58.884605 23721 Train.cpp:206] Number of words: 200356
I0716 11:23:00.683171 23721 W2lListFilesDataset.cpp:109] Empty dataset
I0716 11:23:00.683213 23721 W2lListFilesDataset.cpp:109] Empty dataset
I0716 11:23:00.683216 23721 Utils.cpp:102] Filtered 0/0 samples
I0716 11:23:00.683233 23721 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
I0716 11:23:00.727238 23720 W2lListFilesDataset.cpp:109] Empty dataset
I0716 11:23:00.727275 23720 W2lListFilesDataset.cpp:109] Empty dataset
I0716 11:23:00.727278 23720 Utils.cpp:102] Filtered 0/0 samples
I0716 11:23:00.727293 23720 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
I0716 11:23:01.128192 23721 W2lListFilesDataset.cpp:109] Empty dataset
I0716 11:23:01.128227 23721 Utils.cpp:102] Filtered 0/0 samples
I0716 11:23:01.128232 23721 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
I0716 11:23:01.166246 23720 W2lListFilesDataset.cpp:109] Empty dataset
I0716 11:23:01.166280 23720 Utils.cpp:102] Filtered 0/0 samples
I0716 11:23:01.166286 23720 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
I0716 11:23:01.428975 23719 Train.cpp:252] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> (40) -> (41) -> (42) -> (43) -> (44) -> (45) -> (46) -> (47) -> (48) -> (49) -> (50) -> (51) -> (52) -> (53) -> (54) -> (55) -> (56) -> (57) -> (58) -> (59) -> (60) -> (61) -> (62) -> (63) -> (64) -> (65) -> (66) -> (67) -> (68) -> (69) -> (70) -> output]
    (0): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
    (1): View (-1 1 80 0)
    (2): Conv2D (80->1024, 3x1, 2,1, SAME,0, 1, 1) (with bias)
    (3): ReLU
    (4): Dropout (0.150000)
    (5): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (6): 
    Res(0): Input; skip connection to output 
    Res(1): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(2): ReLU
    Res(3): Dropout (0.150000)
    Res(4): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(5): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(6): ReLU
    Res(7): Dropout (0.150000)
    Res(8): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(9): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias) with scale (before layer is applied) 0.70711;
    Res(10): Output;
    (7): ReLU
    (8): Dropout (0.150000)
    (9): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (10): 
    Res(0): Input; skip connection to output 
    Res(1): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(2): ReLU
    Res(3): Dropout (0.150000)
    Res(4): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(5): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(6): ReLU
    Res(7): Dropout (0.150000)
    Res(8): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(9): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias) with scale (before layer is applied) 0.70711;
    Res(10): Output;
    (11): ReLU
    (12): Dropout (0.150000)
    (13): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (14): 
    Res(0): Input; skip connection to output 
    Res(1): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(2): ReLU
    Res(3): Dropout (0.150000)
    Res(4): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(5): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(6): ReLU
    Res(7): Dropout (0.150000)
    Res(8): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(9): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias) with scale (before layer is applied) 0.70711;
    Res(10): Output;
    (15): ReLU
    (16): Dropout (0.150000)
    (17): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (18): Pool2D-max (2x1, 2,1, 0,0)
    (19): 
    Res(0): Input; skip connection to output 
    Res(1): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(2): ReLU
    Res(3): Dropout (0.150000)
    Res(4): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(5): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(6): ReLU
    Res(7): Dropout (0.150000)
    Res(8): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(9): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias) with scale (before layer is applied) 0.70711;
    Res(10): Output;
    (20): ReLU
    (21): Dropout (0.150000)
    (22): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (23): 
    Res(0): Input; skip connection to output 
    Res(1): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(2): ReLU
    Res(3): Dropout (0.150000)
    Res(4): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(5): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(6): ReLU
    Res(7): Dropout (0.150000)
    Res(8): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(9): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias) with scale (before layer is applied) 0.70711;
    Res(10): Output;
    (24): ReLU
    (25): Dropout (0.150000)
    (26): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (27): 
    Res(0): Input; skip connection to output 
    Res(1): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(2): ReLU
    Res(3): Dropout (0.150000)
    Res(4): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(5): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(6): ReLU
    Res(7): Dropout (0.150000)
    Res(8): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(9): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias) with scale (before layer is applied) 0.70711;
    Res(10): Output;
    (28): ReLU
    (29): Dropout (0.150000)
    (30): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (31): 
    Res(0): Input; skip connection to output 
    Res(1): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(2): ReLU
    Res(3): Dropout (0.150000)
    Res(4): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(5): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(6): ReLU
    Res(7): Dropout (0.150000)
    Res(8): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(9): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias) with scale (before layer is applied) 0.70711;
    Res(10): Output;
    (32): ReLU
    (33): Dropout (0.150000)
    (34): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (35): Pool2D-max (2x1, 2,1, 0,0)
    (36): 
    Res(0): Input; skip connection to output 
    Res(1): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(2): ReLU
    Res(3): Dropout (0.150000)
    Res(4): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(5): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(6): ReLU
    Res(7): Dropout (0.150000)
    Res(8): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(9): Conv2D (1024->1024, 3x1, 1,1, SAME,0, 1, 1) (with bias) with scale (before layer is applied) 0.70711;
    Res(10): Output;
    (37): ReLU
    (38): Dropout (0.150000)
    (39): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (40): Conv2D (1024->2048, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    (41): ReLU
    (42): Dropout (0.150000)
    (43): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (44): 
    Res(0): Input; skip connection to output 
    Res(1): Conv2D (2048->2048, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(2): ReLU
    Res(3): Dropout (0.200000)
    Res(4): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(5): Conv2D (2048->2048, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(6): ReLU
    Res(7): Dropout (0.200000)
    Res(8): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(9): Conv2D (2048->2048, 3x1, 1,1, SAME,0, 1, 1) (with bias) with scale (before layer is applied) 0.70711;
    Res(10): Output;
    (45): ReLU
    (46): Dropout (0.200000)
    (47): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (48): 
    Res(0): Input; skip connection to output 
    Res(1): Conv2D (2048->2048, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(2): ReLU
    Res(3): Dropout (0.200000)
    Res(4): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(5): Conv2D (2048->2048, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(6): ReLU
    Res(7): Dropout (0.200000)
    Res(8): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(9): Conv2D (2048->2048, 3x1, 1,1, SAME,0, 1, 1) (with bias) with scale (before layer is applied) 0.70711;
    Res(10): Output;
    (49): ReLU
    (50): Dropout (0.200000)
    (51): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (52): Conv2D (2048->2304, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    (53): ReLU
    (54): Dropout (0.200000)
    (55): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (56): Pool2D-max (2x1, 2,1, 0,0)
    (57): 
    Res(0): Input; skip connection to output 
    Res(1): Conv2D (2304->2304, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(2): ReLU
    Res(3): Dropout (0.250000)
    Res(4): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(5): Conv2D (2304->2304, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(6): ReLU
    Res(7): Dropout (0.250000)
    Res(8): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(9): Conv2D (2304->2304, 3x1, 1,1, SAME,0, 1, 1) (with bias) with scale (before layer is applied) 0.70711;
    Res(10): Output;
    (58): ReLU
    (59): Dropout (0.250000)
    (60): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (61): 
    Res(0): Input; skip connection to output 
    Res(1): Conv2D (2304->2304, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(2): ReLU
    Res(3): Dropout (0.250000)
    Res(4): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(5): Conv2D (2304->2304, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    Res(6): ReLU
    Res(7): Dropout (0.250000)
    Res(8): LayerNorm ( axis : { 0 1 2 } , size : -1)
    Res(9): Conv2D (2304->2304, 3x1, 1,1, SAME,0, 1, 1) (with bias) with scale (before layer is applied) 0.70711;
    Res(10): Output;
    (62): ReLU
    (63): Dropout (0.250000)
    (64): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (65): Conv2D (2304->2304, 3x1, 1,1, SAME,0, 1, 1) (with bias)
    (66): ReLU
    (67): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (68): Dropout (0.250000)
    (69): Conv2D (2304->9998, 1x1, 1,1, SAME,0, 1, 1) (with bias)
    (70): Reorder (2,0,3,1)
I0716 11:23:01.429167 23719 Train.cpp:253] [Network Params: 306268510]
I0716 11:23:01.429203 23719 Train.cpp:254] [Criterion] ConnectionistTemporalClassificationCriterion
I0716 11:23:01.566375 23719 Train.cpp:262] [Network Optimizer] SGD (momentum=0.6)
I0716 11:23:01.566393 23719 Train.cpp:263] [Criterion Optimizer] SGD
davidbelle commented 4 years ago

@tlikhomanenko update!

I implemented the code found on post 507 and lowered the lr to 0.1, the graph is looking closer to how it should be, although still not quite there yet. I'll keep playing around with hyper-parameters. Thanks.

image

tlikhomanenko commented 4 years ago

Ok, now the loss on dev and train looks much better. However the WER is really weird. Can you run Test.cpp to check the output on your data that it gives some reasonable things? Seems like you could have some errors in the lexicon/tokens, not sure.

About hacking the model: you have method model->modules() which returns you vector of all layers. So you need to create Sequential model and just add to it all modules you need. So for example you can skip specaug layer inserting only modules()[1:] or changing the last layer: modules()[:-1] and then create new linear and add it to the sequential model. This you should do right after the model is loaded but make sure to do this only once for the very first load.

davidbelle commented 4 years ago

@tlikhomanenko

I inspected my lexicon file and found that I was essentially doubling up on entries. I have recreated a new one now that is unique and sorted. However the reason why the WER is off is because I was plotting the LER by accident, my mistake. I also changed the ratio of training to testing data from 8:2 to 7:3 which I think has helped.

image

Thanks for the tip regarding modifying the model, big help! Working on it now.

tlikhomanenko commented 4 years ago

Still it is strange that WER is going up for training and valid and not decreasing. Both should decrease during training (or we can see at the end of training that loss on valid goes up while wer on valid continue to go down). You need to recheck that WER is computed correctly. Again just run test.cpp on intermediate snapshot (i guess it will be random)

davidbelle commented 4 years ago

Hi @tlikhomanenko

I let it train over night just to see how it would change (also set --reportiters back to 1000, I had changed it to 10 just so I could see what it was doing quicker).

Here's the graph: maybe

Here's the output on Test from generated model

/root/wav2letter/build/Test --am /1tb/training/model/tnvdata/001_model_train.valid.list.bin --test /1tb/training/text/train.test.list --tokensdir /1tb/models/production/ --tokens librispeech-train-all-unigram-10000.tokens --lexicon /1tb/training/fairseq/decoder.lexicon --datadir '' --maxload 10
I0718 01:46:05.182943 28177 Test.cpp:83] Gflags after parsing 
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/1tb/training/model/tnvdata/001_model_train.valid.list.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=am_resnet_ctc.arch; --archdir=/1tb/models/arch; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=1; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/1tb/training/train.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=87000; --itersave=false; --labelsmooth=0.050000000000000003; --leftWindowSize=50; --lexicon=/1tb/training/fairseq/decoder.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.10000000000000001; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=true; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=1; --maxisz=9223372036854775807; --maxload=10; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=200; --minrate=3; --minsil=0; --mintsz=2; --momentum=0.59999999999999998; --netoptim=sgd; --noresample=false; --nthread=4; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=100; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=1000; --rightWindowSize=50; --rndv_filepath=; --rundir=/1tb/training/model; --runname=tnvdata; --samplerate=16000; --sampletarget=0.01; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=; --seed=0; --show=false; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=ltr; --test=/1tb/training/text/train.test.list; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/1tb/models/production/; --train=train.test.list; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=train.valid.list; --warmup=1; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; 
I0718 01:46:05.186667 28177 Test.cpp:104] Number of classes (network): 9998
I0718 01:46:06.506603 28177 Test.cpp:111] Number of words: 200367
I0718 01:46:06.990355 28177 W2lListFilesDataset.cpp:155] 522 files found. 
I0718 01:46:06.990393 28177 Utils.cpp:102] Filtered 0/522 samples
I0718 01:46:06.990453 28177 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 522
I0718 01:46:06.990523 28177 Test.cpp:131] [Dataset] Dataset loaded.
I0718 01:46:10.880427 28177 Test.cpp:317] ------
I0718 01:46:10.880448 28177 Test.cpp:318] [Test /1tb/training/text/train.test.list (10 samples) in 3.88988s (actual decoding time 0.389s/sample) -- WER: 95.8963, LER: 67.5331]

I believe it might be that the test samples aren't suitable against the training data. tI'll investigate further and report back.

tlikhomanenko commented 4 years ago

The problem not even in test. The problem in your train WER, it shouldn't be around 100%. You need to debug why with loss decreasing you still have high train WER (so far it doesn't train).

gkucsko commented 3 years ago

Would there be interest in integrating pyctcdecode? https://github.com/kensho-technologies/pyctcdecode It supports inference time ‘hotwords’ as well as a few other things out of the box during decoding.

tlikhomanenko commented 3 years ago

Sounds interesting if we can improve decoder to support more features. However be care with the interface API to preserve its generalization. So any PR is welcome.