Fine tune CTC model - Githubissues

I want to use wav2letter to adapt it to my training dataset. I am following the colab notebook FineTuneCTC.ipynb which uses

./flashlight/build/bin/asr/fl_asr_tutorial_finetune_ctc model.bin

I have a few questions around this approach:

My dataset is also English but has different pronunciations for certain words, e.g. carpark is pronounced as "capak". There are still speakers in my dataset which will use more American English and thus will pronounce carpark. I want the ASR system to be able to pick up on both variations.
There are also new words in my dataset, all based on the same token set [a-z] as which the acoustic model has been trained for.

So given 1,2, does it make sense for me to directly use FineTuneCTC.ipynb or there are additional steps for me to do here?

Is this the best model to use in my case or should i use SOTA 2019 model?

Thanks!

Please use recently release rasr models as they trained on more diverse data and could generalize to your data much better.

About pronunciation - model is trained not with the phonemes but letters, so if you will have both variations in the training data model will learn both variants of pronunciation (also if one is learn by pretrained model and you finetune on the second variant still probably model will be able to recognize both variants).

For new words - it is ok, finetuning is exactly for this case too, to adapt to new words, speakers, pronunciation, domain shift you have in your data.

No need of extra steps, just finetune on your data.

Thanks @tlikhomanenko for your detailed reply.

About pronunciation - model is trained not with the phonemes but letters, so if you will have both variations in the training data model will learn both variants of pronunciation (also if one is learn by pretrained model and you finetune on the second variant still probably model will be able to recognize both variants).

My training data has both variants acoustically. Do I need to have two entries in the lexicon then, e.g.,

carpark | c a r p a r k carpark | c a p a k

How should i generate the lexicon file for my training data? Do I need to manually make two or more entries for words with several pronunciations? My initial plan was to take all unique words in my training set and generate the lexicon file by introducing spacing between them but that would only have one entry in the lexicon for each word.
Will the model still pick up on the pronunciation even if I have just have one definition for it in the lexicon for my training data?

No, this one of the main advantages of training letter-based or word-piece models, that you don't need pronunciation dictionary, just letter or word-piece spelling. So in you case just add carpark c a r p a r k | into lexicon file.

2\. Will the model still pick up on the pronunciation even if I have just have one definition for it in the lexicon for my training data?

Yep, this is the nice thing, that model can be trained to output letters so it will learn all pronunciations inside itself.

That’s great to hear.

Am i right to say that in this case there is no language models involved as in the finetune script you only need to specify a --lexicon and --tokens file.
What happens if one pronunciation is more likely to occur in certain contexts than the other one. I was wondering if an n-gram model might be useful in such situation?
Once I have fintuned a model to my own data, can i still add new words to the lexicon without having any acoustic data for it. For example, adding domain specific words?

That’s great to hear.

Am i right to say that in this case there is no language models involved as in the finetune script you only need to specify a --lexicon and --tokens file.

yes

What happens if one pronunciation is more likely to occur in certain contexts than the other one. I was wondering if an n-gram model might be useful in such situation?

After finetuning you still can ddecode with language model, so it can fix the errors in the context

Once I have fintuned a model to my own data, can i still add new words to the lexicon without having any acoustic data for it. For example, adding domain specific words?

Yep, you can add to the lexicon with which you run beam-search decoding, it could generate then this word (especially if your language model was trained on it).

Thanks @tlikhomanenko

What happens if one pronunciation is more likely to occur in certain contexts than the other one. I was wondering if an n-gram model might be useful in such situation?

After finetuning you still can ddecode with language model, so it can fix the errors in the context

Once I have fintuned a model to my own data, can i still add new words to the lexicon without having any acoustic data for it. For example, adding domain specific words?

Yep, you can add to the lexicon with which you run beam-search decoding, it could generate then this word (especially if your language model was trained on it).

Does that mean I need sentences to build the language model for the new words, or is there a simpler way to just add new words directly to the lexicon file?
Also I was wondering if I take a prebuilt language model such as, _lm_common_crawl_small_4gram_prun0-6-15200kvocab.bin and then expand the language model by adding new sentences to it? Because I dont have the sentences on which those language model was trained to be able to train it from scratch by including my own sentences to it.

I have tried running the finetuneCTC script on my data

116 FinetuneCTC.cpp:285] [Network Optimizer] SGD (momentum=0.8) I0210 10:39:02.771209 2116 FinetuneCTC.cpp:536] Shuffling trainset I0210 10:39:02.771793 2116 FinetuneCTC.cpp:543] Epoch 1 started! I0210 10:39:13.805249 2116 FinetuneCTC.cpp:536] Shuffling trainset I0210 10:39:13.805560 2116 FinetuneCTC.cpp:543] Epoch 2 started! I0210 10:39:23.783844 2116 FinetuneCTC.cpp:536] Shuffling trainset I0210 10:39:23.784200 2116 FinetuneCTC.cpp:543] Epoch 3 started! ^C^C

However it did not create _001_model_dev.bin file in the checkpoint directory. The only files there are _001config and _001log

Thanks @tlikhomanenko

What happens if one pronunciation is more likely to occur in certain contexts than the other one. I was wondering if an n-gram model might be useful in such situation?

After finetuning you still can ddecode with language model, so it can fix the errors in the context

Once I have fintuned a model to my own data, can i still add new words to the lexicon without having any acoustic data for it. For example, adding domain specific words?

Yep, you can add to the lexicon with which you run beam-search decoding, it could generate then this word (especially if your language model was trained on it).

Does that mean I need sentences to build the language model for the new words, or is there a simpler way to just add new words directly to the lexicon file?

Not necessary, you can try first just to add to the lexicon (lm will predict as unk word, but maybe it is ok to infer still). The second option is to add to the lm training data and retrain lm.

Also I was wondering if I take a prebuilt language model such as, _lm_common_crawl_small_4gram_prun0-6-15200kvocab.bin and then expand the language model by adding new sentences to it? Because I dont have the sentences on which those language model was trained to be able to train it from scratch by including my own sentences to it.

Here you need probably to do interpolation between lms, take lm on common crawl and lm trained on your data and do interpolation between them. Let me add the arpa file to the same path as bin file and then with SRILM toolkit you can do interpolation using two lms in arpa format.

I have tried running the finetuneCTC script on my data

116 FinetuneCTC.cpp:285] [Network Optimizer] SGD (momentum=0.8) I0210 10:39:02.771209 2116 FinetuneCTC.cpp:536] Shuffling trainset I0210 10:39:02.771793 2116 FinetuneCTC.cpp:543] Epoch 1 started! I0210 10:39:13.805249 2116 FinetuneCTC.cpp:536] Shuffling trainset I0210 10:39:13.805560 2116 FinetuneCTC.cpp:543] Epoch 2 started! I0210 10:39:23.783844 2116 FinetuneCTC.cpp:536] Shuffling trainset I0210 10:39:23.784200 2116 FinetuneCTC.cpp:543] Epoch 3 started! ^C^C

However it did not create _001_model_dev.bin file in the checkpoint directory. The only files there are _001config and _001log

check --reportiters, try to set it to smaller value, probably you have large number of updates before eval will be called. Models are saved only after evaluation done, which is called every reportiters updates (=0 means every epoch, so you can set --reporiters=0).

Thanks @tlikhomanenko

What happens if one pronunciation is more likely to occur in certain contexts than the other one. I was wondering if an n-gram model might be useful in such situation?

After finetuning you still can ddecode with language model, so it can fix the errors in the context

Once I have fintuned a model to my own data, can i still add new words to the lexicon without having any acoustic data for it. For example, adding domain specific words?

Yep, you can add to the lexicon with which you run beam-search decoding, it could generate then this word (especially if your language model was trained on it).

Does that mean I need sentences to build the language model for the new words, or is there a simpler way to just add new words directly to the lexicon file?

Not necessary, you can try first just to add to the lexicon (lm will predict as unk word, but maybe it is ok to infer still). The second option is to add to the lm training data and retrain lm.

I ran the RASR CTC model on my speech dataset which has been generated by random sentences modeled around specific names, locations etc. Here is one example: I added one word 'kue' manually to the lexicon and used the test script with viterbi decoding. This is the output for one such example

with the language model (lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin) it gave

Seems that the LM is not predicting it as UNK but rather still predicting 'quel' for it. I tried fine tuning too but it didnot improve the results much

What can i do to add new words which may not be in the train.lst in the fine tuning step?
I also looked at #737 but I guess this is not applicable to the RASR model as there was no .model file for RASR?

Also I was wondering if I take a prebuilt language model such as, _lm_common_crawl_small_4gram_prun0-6-15200kvocab.bin and then expand the language model by adding new sentences to it? Because I dont have the sentences on which those language model was trained to be able to train it from scratch by including my own sentences to it.

Here you need probably to do interpolation between lms, take lm on common crawl and lm trained on your data and do interpolation between them. Let me add the arpa file to the same path as bin file and then with SRILM toolkit you can do interpolation using two lms in arpa format.

I tried downloading the arpa model from the path _lm_common_crawl_small_4gram_prun0-6-15_200kvocab.arpa but didnot find it yet
is it useful i make a language model from the same sentences that were used to train the acoustic model? My concern is that my dataset has sentences are randomly generated around names etc and may not be good from a perplexity perspective. Is there any consideration on the perplexity of the two LMs when doing interpolation?

I have tried running the finetuneCTC script on my data

116 FinetuneCTC.cpp:285] [Network Optimizer] SGD (momentum=0.8) I0210 10:39:02.771209 2116 FinetuneCTC.cpp:536] Shuffling trainset I0210 10:39:02.771793 2116 FinetuneCTC.cpp:543] Epoch 1 started! I0210 10:39:13.805249 2116 FinetuneCTC.cpp:536] Shuffling trainset I0210 10:39:13.805560 2116 FinetuneCTC.cpp:543] Epoch 2 started! I0210 10:39:23.783844 2116 FinetuneCTC.cpp:536] Shuffling trainset I0210 10:39:23.784200 2116 FinetuneCTC.cpp:543] Epoch 3 started! ^C^C However it did not create _001_model_dev.bin file in the checkpoint directory. The only files there are _001config and _001log

check --reportiters, try to set it to smaller value, probably you have large number of updates before eval will be called. Models are saved only after evaluation done, which is called every reportiters updates (=0 means every epoch, so you can set --reporiters=0).

Yes that did solve the problem, it had been set to 100. I was finally able to do fine tuning on my dataset and it did improve the WER after a few iterations, to about 10% and with a language model to 8% which is really good. This speech dataset was based on sentences inspired from librispeech and other open source corpuses and as such did not include any special words missing from the lexicon so the ASR performance is very good even with no finetuning meaning that the local accent is pretty well covered.

Thanks @tlikhomanenko

What happens if one pronunciation is more likely to occur in certain contexts than the other one. I was wondering if an n-gram model might be useful in such situation?

After finetuning you still can ddecode with language model, so it can fix the errors in the context

Once I have fintuned a model to my own data, can i still add new words to the lexicon without having any acoustic data for it. For example, adding domain specific words?

Yep, you can add to the lexicon with which you run beam-search decoding, it could generate then this word (especially if your language model was trained on it).

Does that mean I need sentences to build the language model for the new words, or is there a simpler way to just add new words directly to the lexicon file?

Not necessary, you can try first just to add to the lexicon (lm will predict as unk word, but maybe it is ok to infer still). The second option is to add to the lm training data and retrain lm.

I ran the RASR CTC model on my speech dataset which has been generated by random sentences modeled around specific names, locations etc. I also looked at #737 but I guess this is not applicable to the RASR model as there was no .model file for RASR? I added one word 'kue' manually to the lexicon and used the test script with viterbi decoding on this dataset. This is the output for one such example

|T|: i | l i k e | k u e | l u p i s | v e r y | m u c h |P|: i | l i k e | q u e l a p e s | v e r y | m u c h

with the language model (lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin) it gave

|p|: i | l i k e | q u e l | a p e s | v e r y | m u c h

Seems that the LM is not prediting it as UNK but rather still predicting 'quel' for it.

*.model you don't need as the token set as simply letters.
here is the arpa file which you can use to do interpolation between lms https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.arpa
viterbi output (test.cpp) actually doesn't use lexicon as it is just argmax on predictions, so any changes on lexicon/lm will no affect this.
you can recognize new words only if you do beam-search decoder in a proper way (or include words into training AM).
lm could predict it is unk but maybe prob for 'quel' is higher, so the final result is with 'quel'.

What can i do to add new words which may not be in the train.lst in the fine tuning step?

If not use during finetuning, then you can fix this only with beam-search decoding by including into lexicon and into lm. Another way you do recognition with allowing unk words predictions and then with post processing resolving this unk predictions.

is it useful i make a language model from the same sentences that were used to train the acoustic model? My concern is that my dataset has sentences are randomly generated around names etc and maynot be good from a perplexity perspective.

I could expect this will be very bad. One way is to add only unigram predictions. Here you can do interpolation between common crawl lm and unigram lm on your entity words.

I would suggest to search papers on this topic as I have no high expertise exactly with direct methods to solve this problem. I am sure there are plenty things from other teams how to improve entity recognition (one way to do is to detect places of entities and then do separate recognition of them, maybe there is something more advanced and nice).

Also I was wondering if I take a prebuilt language model such as, _lm_common_crawl_small_4gram_prun0-6-15200kvocab.bin and then expand the language model by adding new sentences to it? Because I dont have the sentences on which those language model was trained to be able to train it from scratch by including my own sentences to it.

Here you need probably to do interpolation between lms, take lm on common crawl and lm trained on your data and do interpolation between them. Let me add the arpa file to the same path as bin file and then with SRILM toolkit you can do interpolation using two lms in arpa format.

So for LM for my data must be based on the same sentences used to train the AM or it can include other sentences too. My understanding is that for the LM the corpus can be different from the one used to train the acoustic model and it is better to use a LM trained on a much larger corpus.

Yep, correct. Better to train LM on other than AM data (you can include your AM data into LM training too though). And of course better LM (which means also more data on which to train) better decoding you will have.

Please provide me with the path to the arpa common crawl, i will try interpolation

see above the link.

Is there any consideration on the perplexity of the two LMs when doing interpolation?

Depends. Also it could be that interpolated LM will be worse than each separate LMs on their in-domain corpuses, but generalization of interpolated LM will be better. Depends on which data and how you measure.

I have tried running the finetuneCTC script on my data

116 FinetuneCTC.cpp:285] [Network Optimizer] SGD (momentum=0.8) I0210 10:39:02.771209 2116 FinetuneCTC.cpp:536] Shuffling trainset I0210 10:39:02.771793 2116 FinetuneCTC.cpp:543] Epoch 1 started! I0210 10:39:13.805249 2116 FinetuneCTC.cpp:536] Shuffling trainset I0210 10:39:13.805560 2116 FinetuneCTC.cpp:543] Epoch 2 started! I0210 10:39:23.783844 2116 FinetuneCTC.cpp:536] Shuffling trainset I0210 10:39:23.784200 2116 FinetuneCTC.cpp:543] Epoch 3 started! ^C^C However it did not create _001_model_dev.bin file in the checkpoint directory. The only files there are _001config and _001log

check --reportiters, try to set it to smaller value, probably you have large number of updates before eval will be called. Models are saved only after evaluation done, which is called every reportiters updates (=0 means every epoch, so you can set --reporiters=0).

Yes that did solve the problem, it had been set to 100. I was finally able to do fine tuning on my dataset and it did improve the WER after a few iterations, to about 10% and with a language model to 8% which is really good. This speech dataset was based on sentences inspired from librispeech and other open source corpuses and as such did not include any special words missing from the lexicon so the ASR performance is very good even with no finetuning meaning that the local accent is pretty well covered.

Happy to hear!

@tlikhomanenko thank you again for your detailed comments. I will have to try these different approaches to add new vocabulary The first one I will do is to finetune the model to my training data.

My train set is 450K samples with test and validation sets having about 150 K samples. I ran the mpirun -n 4 --allow-run-as-root ./flashlight/build/bin/asr/fl_asr_train fork model.bin --datadir /data --train tr.lst. --valid dev:val.lst --arch arch.txt --tokens tokens.txt --lexicon lexicon.txt --rundir checkpoint --lr 0.025 --netoptim sgd --momentum 0.8 --reportiters 0 --lr_decay 100 --lr_decay_step 50 --iter 25000 --batchsize 4 --warmup 0 --enable_distributed=true > log.txt however after starting epoch 1 it did not finish and produced some error. Then i ran the same command with 10K training samples and it ran fine finishing several epochs.

Is there a way I can run the code with all my training data, maybe there are some any parameters I might need to tweak?
Am I right that after finetuning RASR can only infer new words if we run it in the lexicon free mode. Given that after finetuning the acoustic model has been updated with the new words, new words will only be decoded if we run it lexicon free? Or perhaps I should add the new words into the lexicon file and then use the lexicon enabled decoding.
I was wondering if there is any flag in flashlight/build/bin/asr/fl_asr_test to output the recognized text in the format of an lst file?

Is there a way I can run the code with all my training data, maybe there are some any parameters I might need to tweak?

What is the error you see when run with all data?

Am I right that after finetuning RASR can only infer new words if we run it in the lexicon free mode. Given that after finetuning the acoustic model has been updated with the new words, new words will only be decoded if we run it lexicon free? Or perhaps I should add the new words into the lexicon file and then use the lexicon enabled decoding.

It can infer new words you added during training if you add them also in lexicon during decoding. About purely new words - yes, only lexicon free decoding. Even you used some words during training, but didn't add them into lexicon during decoding, it cannot infer them as lexicon restricting the search.

I was wondering if there is any flag in flashlight/build/bin/asr/fl_asr_test to output the recognized text in the format of an lst file?

fl_asr_test is producing viterbi output, so it is just argmax and can produce and words if you use --uselexicon=false for its run. To save into file you can use --sclite=[PATH].

What is the error you see when run with all data?

This is the error I am getting, after about 1 hr since epoch 1 has started

I0303 03:48:32.075875  2943 Train.cpp:1028] Epoch 1 started!
Failed to allocate memory of size 152.00 MiB (Device: 0, Capacity: 14.75 GiB, Allocated: 13.93 GiB, Cached: 721.55 MiB) with error 'ArrayFire Exception (Device out of memory:101):
ArrayFire error: 
In function fl::MemoryManagerInstaller::MemoryManagerInstaller(std::shared_ptr<fl::MemoryManagerAdapter>)::<lambda(size_t)>
In file /home/jupyter/flashlight/flashlight/fl/memory/MemoryManagerInstaller.cpp:180'
terminate called after throwing an instance of 'af::exception'
  what():  ArrayFire Exception (Device out of memory:101):
In function virtual void* MemoryManagerFunctionWrapper::alloc(bool, unsigned int, dim_t*, unsigned int)
In file src/api/c/memory.cpp:706

GPU INFO

full log below:

0303 03:48:28.152590  2943 Train.cpp:198] Experiment path: checkpoint
I0303 03:48:28.152596  2943 Train.cpp:199] Experiment runidx: 1
I0303 03:48:28.153079  2943 Train.cpp:272] Number of classes (network): 29
I0303 03:48:28.156661  2946 Train.cpp:272] Number of classes (network): 29
I0303 03:48:28.161386  2945 Train.cpp:272] Number of classes (network): 29
I0303 03:48:28.161387  2944 Train.cpp:272] Number of classes (network): 29
I0303 03:48:28.477886  2943 Train.cpp:279] Number of words: 200001
I0303 03:48:28.478466  2946 Train.cpp:279] Number of words: 200001
I0303 03:48:28.518005  2944 Train.cpp:279] Number of words: 200001
I0303 03:48:28.522454  2945 Train.cpp:279] Number of words: 200001
I0303 03:48:31.587702  2943 Train.cpp:454] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> (40) -> (41) -> (42) -> output]
    (0): View (-1 1 80 0)
    (1): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (2): Conv2D (80->768, 7x1, 3,1, SAME,0, 1, 1) (with bias)
    (3): GatedLinearUnit (2)
    (4): Dropout (0.050000)
    (5): Reorder (2,0,3,1)
    (6): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (7): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (8): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (9): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (10): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (11): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (12): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (13): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (14): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (15): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (16): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (17): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (18): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (19): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (20): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (21): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (22): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (23): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (24): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (25): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (26): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (27): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (28): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (29): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (30): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (31): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (32): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (33): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (34): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (35): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (36): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (37): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (38): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (39): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (40): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (41): Transformer (nHeads: 4), (pDropout: 0.05), (pLayerdrop: 0.05), (bptt: 920), (useMask: 0), (preLayerNorm: 0)
    (42): Linear (384->29) (with bias)
I0303 03:48:31.587855  2943 Train.cpp:455] [Network Params: 70498735]
I0303 03:48:31.587899  2943 Train.cpp:456] [Criterion] ConnectionistTemporalClassificationCriterion
I0303 03:48:31.618924  2943 Train.cpp:490] [Network Optimizer] SGD (momentum=0.8)
I0303 03:48:31.618974  2943 Train.cpp:491] [Criterion Optimizer] Adagrad (epsilon=1e-08)
I0303 03:48:31.920735  2943 Train.cpp:1021] Shuffling trainset
I0303 03:48:32.075875  2943 Train.cpp:1028] Epoch 1 started!
Failed to allocate memory of size 152.00 MiB (Device: 0, Capacity: 14.75 GiB, Allocated: 13.93 GiB, Cached: 721.55 MiB) with error 'ArrayFire Exception (Device out of memory:101):
ArrayFire error: 
In function fl::MemoryManagerInstaller::MemoryManagerInstaller(std::shared_ptr<fl::MemoryManagerAdapter>)::<lambda(size_t)>
In file /home/jupyter/flashlight/flashlight/fl/memory/MemoryManagerInstaller.cpp:180'
terminate called after throwing an instance of 'af::exception'
  what():  ArrayFire Exception (Device out of memory:101):
In function virtual void* MemoryManagerFunctionWrapper::alloc(bool, unsigned int, dim_t*, unsigned int)
In file src/api/c/memory.cpp:706

 0# 0x00007FAAB1467EE6 in /opt/arrayfire/lib64/libafcuda.so.3
 1# 0x00007FAAB0C1B436 in /opt/arrayfire/lib64/libafcuda.so.3
 2# 0x00007FAAB0691600 in /opt/arrayfire/lib64/libafcuda.so.3
 3# 0x00007FAAB06916C9 in /opt/arrayfire/lib64/libafcuda.so.3
 4# 0x00007FAAB0AE9445 in /opt/arrayfire/lib64/libafcuda.so.3
 5# af_join in /opt/arrayfire/lib64/libafcuda.so.3
 6# af::join(int, af::array const&, af::array const&) in /opt/arrayfire/lib64/libafcuda.so.3
 7# fl::relativePositionEmbeddingRotate(fl::Variable const&) in /home/jupyter/flashlight/build/bin/asr/fl_asr_train
 8# fl::multiheadAttention(fl::Variable const&, fl::Variable const&, fl::Variable const&, fl::Variable const&, fl::Variable const&, fl::Variable const&, int, double, int) in /home/jupyter/flashlight/build/bin/asr/fl_asr_train
 9# fl::Transformer::selfAttention(std:
*** Aborted at 1614749230 (unix time) try "date -d @1614749230" if you are using GNU date ***
PC: @     0x7faaa5e4a7bb gsignal
*** SIGABRT (@0xb7f) received by PID 2943 (TID 0x7faaa2b13300) from PID 2943; stack trace: ***
    @     0x7faaeebf6730 (unknown)
    @     0x7faaa5e4a7bb gsignal
    @     0x7faaa5e35535 abort
    @     0x7faab25ce275 __gnu_cxx::__verbose_terminate_handler()
    @     0x7faab253e396 __cxxabiv1::__terminate()
    @     0x7faab253e3e1 std::terminate()
    @     0x7faab2532143 __cxa_throw
    @     0x7faab1684a4e af::join()
    @     0x5577207bedbd fl::relativePositionEmbeddingRotate()
    @     0x5577207c21d5 fl::multiheadAttention()
    @     0x55772079e9b8 fl::Transformer::selfAttention()
    @     0x55772079f0b5 fl::Transformer::forward()
    @     0x55772083a427 fl::ext::forwardSequentialModuleWithPadMask()
    @     0x5577205ba0b0 _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_INS0_3app3asr17SequenceCriterionEES_INS0_7DatasetEES_INS0_19FirstOrderOptimizerEESA_ddblE4_clES2_S6_S8_SA_SA_ddbl
    @     0x55772053ab6f main
    @     0x7faaa5e3709b __libc_start_main
    @     0x5577205b3d0a _start
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
*** Aborted at 1614749232 (unix time) try "date -d @1614749232" if you are using GNU date ***
PC: @     0x7fffb51dbb44 ([vdso]+0xb43)
*** SIGTERM (@0xb4f) received by PID 2946 (TID 0x7f25b57e5300) from PID 2895; stack trace: ***
    @     0x7f26018c8730 (unknown)
    @     0x7fffb51dbb44 ([vdso]+0xb43)
    @     0x7f25b8bebff6 __clock_gettime
    @     0x7f25b60534fe (unknown)
    @     0x7f25b61431a4 (unknown)
    @     0x7f25b603c47f (unknown)
    @     0x7f25b603c5e9 (unknown)
    @     0x7f25b5f342fd (unknown)
    @     0x7f25b60ef059 cuStreamSynchronize
    @     0x7f25c43e1f00 cudart::cudaApiStreamSynchronize()
    @     0x7f25c441ee3d cudaStreamSynchronize
    @     0x7f25c390ebdf cuda::sync()
    @     0x7f25c405d70f af_sync
    @     0x7f25c435c502 af::sync()
    @     0x564c5bf7bb42 _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_INS0_3app3asr17SequenceCriterionEES_INS0_7DatasetEES_INS0_19FirstOrderOptimizerEESA_ddblE4_clES2_S6_S8_SA_SA_ddbl
    @     0x564c5befbb6f main
    @     0x7f25b8b0909b __libc_start_main
    @     0x564c5bf74d0a _start
*** Aborted at 1614749232 (unix time) try "date -d @1614749232" if you are using GNU date ***
PC: @     0x7fff8f570b44 ([vdso]+0xb43)
*** SIGTERM (@0xb4f) received by PID 2945 (TID 0x7f09ac74c300) from PID 2895; stack trace: ***
    @     0x7f09f882f730 (unknown)
    @     0x7fff8f570b44 ([vdso]+0xb43)
    @     0x7f09afb52ff6 __clock_gettime
    @     0x7f09acfba4fe (unknown)
    @     0x7f09ad0aa1a4 (unknown)
    @     0x7f09acfa347f (unknown)
    @     0x7f09acfa35e9 (unknown)
    @     0x7f09ace9b2fd (unknown)
    @     0x7f09ad056059 cuStreamSynchronize
    @     0x7f09bb348f00 cudart::cudaApiStreamSynchronize()
    @     0x7f09bb385e3d cudaStreamSynchronize
    @     0x7f09ba875bdf cuda::sync()
    @     0x7f09bafc470f af_sync
    @     0x7f09bb2c3502 af::sync()
    @     0x560909a3fb42 _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_INS0_3app3asr17SequenceCriterionEES_INS0_7DatasetEES_INS0_19FirstOrderOptimizerEESA_ddblE4_clES2_S6_S8_SA_SA_ddbl
    @     0x5609099bfb6f main
    @     0x7f09afa7009b __libc_start_main
    @     0x560909a38d0a _start
*** Aborted at 1614749232 (unix time) try "date -d @1614749232" if you are using GNU date ***
PC: @     0x7fff6056ab44 ([vdso]+0xb43)
*** SIGTERM (@0xb4f) received by PID 2944 (TID 0x7f84b655c300) from PID 2895; stack trace: ***
    @     0x7f850263f730 (unknown)
    @     0x7fff6056ab44 ([vdso]+0xb43)
    @     0x7f84b9962ff6 __clock_gettime
    @     0x7f84b6dca4fe (unknown)
    @     0x7f84b6eba1a4 (unknown)
    @     0x7f84b6db347f (unknown)
    @     0x7f84b6db35e9 (unknown)
    @     0x7f84b6cab2fd (unknown)
    @     0x7f84b6e66059 cuStreamSynchronize
    @     0x7f84c5158f00 cudart::cudaApiStreamSynchronize()
    @     0x7f84c5195e3d cudaStreamSynchronize
    @     0x7f84c4685bdf cuda::sync()
    @     0x7f84c4dd470f af_sync
    @     0x7f84c50d3502 af::sync()
    @     0x556ab1abbb42 _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_INS0_3app3asr17SequenceCriterionEES_INS0_7DatasetEES_INS0_19FirstOrderOptimizerEESA_ddblE4_clES2_S6_S8_SA_SA_ddbl
    @     0x556ab1a3bb6f main
    @     0x7f84b988009b __libc_start_main
    @     0x556ab1ab4d0a _start
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node asr exited on signal 6 (Aborted).
--------------------------------------------------------------------------

The error says itself what is the problem "Failed to allocate memory of size 152.00 MiB (Device: 0, Capacity: 14.75 GiB, Allocated: 13.93 GiB, Cached: 721.55 MiB) with error 'ArrayFire Exception (Device out of memory:101):" - so you have OOM on GPU.

As soon as this happens on some particular batch - check what is the longest audio you have. Probably you need to filter your list file from the very long audio to be ok with memory with your current batch size.

@tlikhomanenko your suggestion worked, I filtered out longer audio files.

I used this command to finetune the model to my dataset which is ~400 k training samples /flashlight/build/bin/asr/fl_asr_train fork model.bin --datadir=/home/ --train=1_train.lst,2_train.lst --valid=1_validate.lst,2_validate.lst --arch=arch.txt --tokens=tokens.txt --lexicon=lexicon.txt --rundir=checkpoint2 --lr=0.025 --netoptim=sgd --reportiters=4 --iter=25000000 --batchsize=4 --warmup=0 &>log2.txt

a) Do I need to update the lexicon file so that it includes the new words that are part of the training set OR this is only needed when doing decoding using the language model?

The training is taking too long (2 hrs plus based on the last iteration) and I still see epoch 1 after each iteration as you can see from the log (pasted below). Do I need to change any parameters? I think its probably to do with the reportiters. Based on the number of training samples, is there a way to decide on reportiters?
for the language model decoder, is there any systematic way to tune the various parameters like lmweight, beam threshold etc?
During acoustic model training is there any need to trim silences before and after each utterance?

@tlikhomanenko your suggestion worked, I filtered out longer audio files.
1. I used this command to finetune the model to my dataset which is ~400 k training samples
   `/flashlight/build/bin/asr/fl_asr_train fork model.bin --datadir=/home/   --train=1_train.lst,2_train.lst   --valid=1_validate.lst,2_validate.lst   --arch=arch.txt   --tokens=tokens.txt   --lexicon=lexicon.txt   --rundir=checkpoint2   --lr=0.025   --netoptim=sgd  --reportiters=4  --iter=25000000   --batchsize=4   --warmup=0  &>log2.txt`
a) Do I need to update the lexicon file so that it includes the new words that are part of the training set OR this is only needed when doing decoding using the language model?

If your tokens are letters - no need, it will fall back to letters sequence for the unk words. If it is wp, then you should add.

1. The training is taking too long (2 hrs plus based on the last iteration) and I still see epoch 1 after each iteration as you can see from the log (pasted below). Do I need to change any parameters? I think its probably to do with the reportiters. Based on the number of training samples, is there a way to decide on reportiters?

reportiters means after what number updates you run validation. You do it after 4 updates, and I suppose your validation is large, so you waste a lot of time on validation check. Just set it to value in 500-3000, this often works ok for me. Mostly I set validation every epoch or 15-30min of training.

2. for the language model decoder, is there any systematic way to tune the various parameters like lmweight, beam threshold etc?

I use random search, it works fine. You can do bayesian optimization too, to find best params with less tries, but so far random search is really good option.

3. During acoustic model training is there any need to trim silences before and after each utterance?

No need as we have special token for silence too "|" and you can use "--surround=|" to use it at the beginning and end. Ideally with training on silence your model should be more robust.

LOG: epoch: 1 | nupdates: 4 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:03 | bch(ms): 767.74 | smp(ms): 88.63 | fwd(ms): 479.98 | crit-fwd(ms): 5.66 | bwd(ms): 170.04 | optim(ms): 28.73 | loss: 5.44458 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.67919 | 1_validate.lst-TER: 8.46 | 1_validate.lst-WER: 21.68 | 2_validate.lst-loss: 4.67824 | 2_validate.lst-TER: 19.89 | 2_validate.lst-WER: 49.57 | avg-isz: 000 | avg-tsz: 053 | max-tsz: 096 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-11 12:40:04 epoch: 1 | nupdates: 8 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 267.30 | smp(ms): 0.05 | fwd(ms): 112.57 | crit-fwd(ms): 3.36 | bwd(ms): 124.86 | optim(ms): 29.49 | loss: 3.58493 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.64116 | 1_validate.lst-TER: 8.38 | 1_validate.lst-WER: 21.51 | 2_validate.lst-loss: 4.61335 | 2_validate.lst-TER: 19.67 | 2_validate.lst-WER: 49.18 | avg-isz: 000 | avg-tsz: 067 | max-tsz: 092 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-11 14:57:22 epoch: 1 | nupdates: 12 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 266.41 | smp(ms): 0.05 | fwd(ms): 111.87 | crit-fwd(ms): 2.91 | bwd(ms): 122.21 | optim(ms): 31.99 | loss: 4.31629 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.58355 | 1_validate.lst-TER: 8.25 | 1_validate.lst-WER: 21.22 | 2_validate.lst-loss: 4.53135 | 2_validate.lst-TER: 19.40 | 2_validate.lst-WER: 48.76 | avg-isz: 000 | avg-tsz: 058 | max-tsz: 100 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-11 17:14:46 epoch: 1 | nupdates: 16 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:00 | bch(ms): 247.69 | smp(ms): 0.05 | fwd(ms): 105.44 | crit-fwd(ms): 2.70 | bwd(ms): 113.95 | optim(ms): 27.95 | loss: 2.44922 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.51435 | 1_validate.lst-TER: 8.11 | 1_validate.lst-WER: 20.89 | 2_validate.lst-loss: 4.43393 | 2_validate.lst-TER: 19.12 | 2_validate.lst-WER: 48.34 | avg-isz: 000 | avg-tsz: 050 | max-tsz: 079 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-11 19:32:05 epoch: 1 | nupdates: 20 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 252.07 | smp(ms): 0.06 | fwd(ms): 106.53 | crit-fwd(ms): 2.33 | bwd(ms): 112.26 | optim(ms): 32.92 | loss: 3.53145 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.45454 | 1_validate.lst-TER: 8.04 | 1_validate.lst-WER: 20.73 | 2_validate.lst-loss: 4.34329 | 2_validate.lst-TER: 18.91 | 2_validate.lst-WER: 48.13 | avg-isz: 000 | avg-tsz: 045 | max-tsz: 069 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-11 21:49:21 epoch: 1 | nupdates: 24 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 462.68 | smp(ms): 0.05 | fwd(ms): 277.28 | crit-fwd(ms): 4.56 | bwd(ms): 156.17 | optim(ms): 28.89 | loss: 5.03519 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.41925 | 1_validate.lst-TER: 8.02 | 1_validate.lst-WER: 20.71 | 2_validate.lst-loss: 4.26934 | 2_validate.lst-TER: 18.68 | 2_validate.lst-WER: 47.72 | avg-isz: 000 | avg-tsz: 067 | max-tsz: 123 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-12 00:06:40 epoch: 1 | nupdates: 28 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 316.14 | smp(ms): 0.07 | fwd(ms): 146.86 | crit-fwd(ms): 3.66 | bwd(ms): 137.72 | optim(ms): 31.16 | loss: 2.87421 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.38867 | 1_validate.lst-TER: 7.98 | 1_validate.lst-WER: 20.66 | 2_validate.lst-loss: 4.21672 | 2_validate.lst-TER: 18.50 | 2_validate.lst-WER: 47.37 | avg-isz: 000 | avg-tsz: 075 | max-tsz: 099 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-12 02:24:05 epoch: 1 | nupdates: 32 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 297.32 | smp(ms): 0.06 | fwd(ms): 129.30 | crit-fwd(ms): 3.85 | bwd(ms): 135.61 | optim(ms): 31.93 | loss: 2.31740 | train-TER: 4.89 | train-WER: 24.44 | 1_validate.lst-loss: 2.36230 | 1_validate.lst-TER: 7.95 | 1_validate.lst-WER: 20.61 | 2_validate.lst-loss: 4.17862 | 2_validate.lst-TER: 18.36 | 2_validate.lst-WER: 47.08 | avg-isz: 000 | avg-tsz: 061 | max-tsz: 102 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-12 04:41:28 epoch: 1 | nupdates: 36 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 267.74 | smp(ms): 0.05 | fwd(ms): 112.66 | crit-fwd(ms): 3.09 | bwd(ms): 125.47 | optim(ms): 29.26 | loss: 4.45074 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.34028 | 1_validate.lst-TER: 7.92 | 1_validate.lst-WER: 20.58 | 2_validate.lst-loss: 4.14444 | 2_validate.lst-TER: 18.27 | 2_validate.lst-WER: 46.91 | avg-isz: 000 | avg-tsz: 057 | max-tsz: 086 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-12 06:58:50 epoch: 1 | nupdates: 40 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:00 | bch(ms): 231.65 | smp(ms): 0.06 | fwd(ms): 98.37 | crit-fwd(ms): 2.38 | bwd(ms): 103.77 | optim(ms): 29.11 | loss: 3.69098 | train-TER: 16.26 | train-WER: 52.17 | 1_validate.lst-loss: 2.32298 | 1_validate.lst-TER: 7.90 | 1_validate.lst-WER: 20.57 | 2_validate.lst-loss: 4.10762 | 2_validate.lst-TER: 18.19 | 2_validate.lst-WER: 46.77 | avg-isz: 000 | avg-tsz: 042 | max-tsz: 080 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-12 09:16:04 epoch: 1 | nupdates: 44 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 368.55 | smp(ms): 0.05 | fwd(ms): 171.69 | crit-fwd(ms): 5.95 | bwd(ms): 168.59 | optim(ms): 27.92 | loss: 3.05313 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.30865 | 1_validate.lst-TER: 7.87 | 1_validate.lst-WER: 20.53 | 2_validate.lst-loss: 4.07866 | 2_validate.lst-TER: 18.12 | 2_validate.lst-WER: 46.63 | avg-isz: 000 | avg-tsz: 084 | max-tsz: 123 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-12 11:33:20 epoch: 1 | nupdates: 48 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:00 | bch(ms): 211.89 | smp(ms): 0.06 | fwd(ms): 90.20 | crit-fwd(ms): 1.63 | bwd(ms): 93.36 | optim(ms): 27.96 | loss: 3.58933 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.30292 | 1_validate.lst-TER: 7.86 | 1_validate.lst-WER: 20.54 | 2_validate.lst-loss: 4.05318 | 2_validate.lst-TER: 18.06 | 2_validate.lst-WER: 46.54 | avg-isz: 000 | avg-tsz: 041 | max-tsz: 090 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-12 13:50:36 epoch: 1 | nupdates: 52 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 304.38 | smp(ms): 0.06 | fwd(ms): 141.39 | crit-fwd(ms): 3.88 | bwd(ms): 131.10 | optim(ms): 31.51 | loss: 4.29644 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.29010 | 1_validate.lst-TER: 7.81 | 1_validate.lst-WER: 20.44 | 2_validate.lst-loss: 4.02052 | 2_validate.lst-TER: 17.90 | 2_validate.lst-WER: 46.21 | avg-isz: 000 | avg-tsz: 058 | max-tsz: 110 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-12 16:07:47 epoch: 1 | nupdates: 56 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:00 | bch(ms): 223.90 | smp(ms): 0.06 | fwd(ms): 91.69 | crit-fwd(ms): 2.13 | bwd(ms): 103.71 | optim(ms): 28.16 | loss: 2.74172 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.27513 | 1_validate.lst-TER: 7.75 | 1_validate.lst-WER: 20.31 | 2_validate.lst-loss: 3.99113 | 2_validate.lst-TER: 17.75 | 2_validate.lst-WER: 45.94 | avg-isz: 000 | avg-tsz: 043 | max-tsz: 069 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-12 18:24:52 epoch: 1 | nupdates: 60 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:00 | bch(ms): 231.56 | smp(ms): 0.05 | fwd(ms): 101.63 | crit-fwd(ms): 2.31 | bwd(ms): 102.10 | optim(ms): 27.50 | loss: 4.87104 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.25372 | 1_validate.lst-TER: 7.62 | 1_validate.lst-WER: 20.02 | 2_validate.lst-loss: 3.94646 | 2_validate.lst-TER: 17.40 | 2_validate.lst-WER: 45.30 | avg-isz: 000 | avg-tsz: 042 | max-tsz: 065 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-12 20:41:56 epoch: 1 | nupdates: 64 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 308.11 | smp(ms): 0.05 | fwd(ms): 135.13 | crit-fwd(ms): 4.44 | bwd(ms): 143.25 | optim(ms): 29.38 | loss: 4.15069 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.24460 | 1_validate.lst-TER: 7.52 | 1_validate.lst-WER: 19.83 | 2_validate.lst-loss: 3.92652 | 2_validate.lst-TER: 17.08 | 2_validate.lst-WER: 44.79 | avg-isz: 000 | avg-tsz: 064 | max-tsz: 147 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-12 22:59:02 epoch: 1 | nupdates: 68 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 275.55 | smp(ms): 0.06 | fwd(ms): 116.78 | crit-fwd(ms): 3.26 | bwd(ms): 128.64 | optim(ms): 29.74 | loss: 2.38166 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.24437 | 1_validate.lst-TER: 7.47 | 1_validate.lst-WER: 19.75 | 2_validate.lst-loss: 3.92277 | 2_validate.lst-TER: 16.94 | 2_validate.lst-WER: 44.57 | avg-isz: 000 | avg-tsz: 054 | max-tsz: 074 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-13 01:16:12 epoch: 1 | nupdates: 72 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 254.72 | smp(ms): 0.05 | fwd(ms): 103.74 | crit-fwd(ms): 2.95 | bwd(ms): 117.75 | optim(ms): 32.89 | loss: 2.26761 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.24409 | 1_validate.lst-TER: 7.46 | 1_validate.lst-WER: 19.72 | 2_validate.lst-loss: 3.91088 | 2_validate.lst-TER: 16.88 | 2_validate.lst-WER: 44.43 | avg-isz: 000 | avg-tsz: 053 | max-tsz: 073 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-13 03:33:18 epoch: 1 | nupdates: 76 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 279.85 | smp(ms): 0.06 | fwd(ms): 121.70 | crit-fwd(ms): 3.42 | bwd(ms): 130.44 | optim(ms): 27.35 | loss: 3.83998 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.24709 | 1_validate.lst-TER: 7.47 | 1_validate.lst-WER: 19.73 | 2_validate.lst-loss: 3.90004 | 2_validate.lst-TER: 16.85 | 2_validate.lst-WER: 44.35 | avg-isz: 000 | avg-tsz: 061 | max-tsz: 117 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-13 05:50:20 epoch: 1 | nupdates: 80 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 335.25 | smp(ms): 0.05 | fwd(ms): 145.44 | crit-fwd(ms): 4.23 | bwd(ms): 159.00 | optim(ms): 30.32 | loss: 2.01023 | train-TER: 4.25 | train-WER: 13.04 | 1_validate.lst-loss: 2.25247 | 1_validate.lst-TER: 7.48 | 1_validate.lst-WER: 19.76 | 2_validate.lst-loss: 3.88504 | 2_validate.lst-TER: 16.79 | 2_validate.lst-WER: 44.23 | avg-isz: 000 | avg-tsz: 067 | max-tsz: 122 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-13 08:07:24 epoch: 1 | nupdates: 84 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 251.69 | smp(ms): 0.05 | fwd(ms): 113.54 | crit-fwd(ms): 2.40 | bwd(ms): 109.01 | optim(ms): 28.82 | loss: 3.05545 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.25752 | 1_validate.lst-TER: 7.50 | 1_validate.lst-WER: 19.79 | 2_validate.lst-loss: 3.86690 | 2_validate.lst-TER: 16.74 | 2_validate.lst-WER: 44.09 | avg-isz: 000 | avg-tsz: 049 | max-tsz: 107 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-13 10:24:27 epoch: 1 | nupdates: 88 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 265.48 | smp(ms): 0.05 | fwd(ms): 113.47 | crit-fwd(ms): 2.66 | bwd(ms): 119.04 | optim(ms): 32.62 | loss: 3.98803 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.26528 | 1_validate.lst-TER: 7.53 | 1_validate.lst-WER: 19.85 | 2_validate.lst-loss: 3.84856 | 2_validate.lst-TER: 16.72 | 2_validate.lst-WER: 43.98 | avg-isz: 000 | avg-tsz: 049 | max-tsz: 099 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-13 12:41:26 epoch: 1 | nupdates: 92 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 271.12 | smp(ms): 0.05 | fwd(ms): 111.59 | crit-fwd(ms): 2.99 | bwd(ms): 128.63 | optim(ms): 30.54 | loss: 2.52905 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.26941 | 1_validate.lst-TER: 7.57 | 1_validate.lst-WER: 19.92 | 2_validate.lst-loss: 3.83679 | 2_validate.lst-TER: 16.74 | 2_validate.lst-WER: 43.98 | avg-isz: 000 | avg-tsz: 057 | max-tsz: 106 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-13 14:58:33 epoch: 1 | nupdates: 96 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 266.27 | smp(ms): 0.05 | fwd(ms): 114.84 | crit-fwd(ms): 2.55 | bwd(ms): 119.89 | optim(ms): 31.20 | loss: 4.68972 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.26676 | 1_validate.lst-TER: 7.57 | 1_validate.lst-WER: 19.92 | 2_validate.lst-loss: 3.82330 | 2_validate.lst-TER: 16.72 | 2_validate.lst-WER: 43.93 | avg-isz: 000 | avg-tsz: 050 | max-tsz: 072 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-13 17:15:29 epoch: 1 | nupdates: 100 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 459.80 | smp(ms): 0.05 | fwd(ms): 256.05 | crit-fwd(ms): 4.46 | bwd(ms): 175.62 | optim(ms): 27.80 | loss: 3.46648 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.26124 | 1_validate.lst-TER: 7.57 | 1_validate.lst-WER: 19.93 | 2_validate.lst-loss: 3.80877 | 2_validate.lst-TER: 16.72 | 2_validate.lst-WER: 43.92 | avg-isz: 000 | avg-tsz: 075 | max-tsz: 120 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-13 19:32:29 epoch: 1 | nupdates: 104 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 273.58 | smp(ms): 0.05 | fwd(ms): 151.74 | crit-fwd(ms): 1.73 | bwd(ms): 92.70 | optim(ms): 28.79 | loss: 3.18890 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.25637 | 1_validate.lst-TER: 7.57 | 1_validate.lst-WER: 19.92 | 2_validate.lst-loss: 3.79254 | 2_validate.lst-TER: 16.66 | 2_validate.lst-WER: 43.82 | avg-isz: 000 | avg-tsz: 037 | max-tsz: 053 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-13 21:49:30 epoch: 1 | nupdates: 108 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:00 | bch(ms): 224.37 | smp(ms): 0.05 | fwd(ms): 94.09 | crit-fwd(ms): 1.80 | bwd(ms): 101.13 | optim(ms): 28.79 | loss: 3.33711 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.25010 | 1_validate.lst-TER: 7.56 | 1_validate.lst-WER: 19.91 | 2_validate.lst-loss: 3.77968 | 2_validate.lst-TER: 16.62 | 2_validate.lst-WER: 43.76 | avg-isz: 000 | avg-tsz: 033 | max-tsz: 051 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-14 00:06:35 epoch: 1 | nupdates: 112 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 313.11 | smp(ms): 0.05 | fwd(ms): 143.12 | crit-fwd(ms): 4.18 | bwd(ms): 139.75 | optim(ms): 29.90 | loss: 4.90623 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.24672 | 1_validate.lst-TER: 7.57 | 1_validate.lst-WER: 19.93 | 2_validate.lst-loss: 3.76660 | 2_validate.lst-TER: 16.56 | 2_validate.lst-WER: 43.67 | avg-isz: 000 | avg-tsz: 054 | max-tsz: 086 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-14 02:23:34 epoch: 1 | nupdates: 116 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 277.48 | smp(ms): 0.05 | fwd(ms): 122.41 | crit-fwd(ms): 3.09 | bwd(ms): 124.83 | optim(ms): 29.92 | loss: 3.47979 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.24914 | 1_validate.lst-TER: 7.55 | 1_validate.lst-WER: 19.91 | 2_validate.lst-loss: 3.74887 | 2_validate.lst-TER: 16.35 | 2_validate.lst-WER: 43.39 | avg-isz: 000 | avg-tsz: 054 | max-tsz: 095 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-14 04:41:14 epoch: 1 | nupdates: 120 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:02 | bch(ms): 554.08 | smp(ms): 20.43 | fwd(ms): 187.21 | crit-fwd(ms): 4.55 | bwd(ms): 292.14 | optim(ms): 29.52 | loss: 2.30118 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.24642 | 1_validate.lst-TER: 7.54 | 1_validate.lst-WER: 19.88 | 2_validate.lst-loss: 3.74103 | 2_validate.lst-TER: 16.27 | 2_validate.lst-WER: 43.31 | avg-isz: 000 | avg-tsz: 059 | max-tsz: 092 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-14 06:58:31 epoch: 1 | nupdates: 124 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:02 | bch(ms): 547.52 | smp(ms): 0.09 | fwd(ms): 272.84 | crit-fwd(ms): 4.48 | bwd(ms): 236.02 | optim(ms): 30.17 | loss: 4.75666 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.24026 | 1_validate.lst-TER: 7.51 | 1_validate.lst-WER: 19.82 | 2_validate.lst-loss: 3.73794 | 2_validate.lst-TER: 16.23 | 2_validate.lst-WER: 43.25 | avg-isz: 000 | avg-tsz: 069 | max-tsz: 105 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-14 09:15:43 epoch: 1 | nupdates: 128 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 287.03 | smp(ms): 0.09 | fwd(ms): 123.13 | crit-fwd(ms): 2.95 | bwd(ms): 135.07 | optim(ms): 28.39 | loss: 4.74566 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.22924 | 1_validate.lst-TER: 7.45 | 1_validate.lst-WER: 19.68 | 2_validate.lst-loss: 3.73832 | 2_validate.lst-TER: 16.17 | 2_validate.lst-WER: 43.20 | avg-isz: 000 | avg-tsz: 050 | max-tsz: 079 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-14 11:32:45 epoch: 1 | nupdates: 132 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 286.77 | smp(ms): 0.09 | fwd(ms): 132.94 | crit-fwd(ms): 3.32 | bwd(ms): 125.62 | optim(ms): 27.64 | loss: 3.01798 | train-TER: 10.45 | train-WER: 34.48 | 1_validate.lst-loss: 2.21500 | 1_validate.lst-TER: 7.39 | 1_validate.lst-WER: 19.52 | 2_validate.lst-loss: 3.74376 | 2_validate.lst-TER: 16.16 | 2_validate.lst-WER: 43.20 | avg-isz: 000 | avg-tsz: 064 | max-tsz: 117 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-14 13:50:11 epoch: 1 | nupdates: 136 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 259.46 | smp(ms): 0.09 | fwd(ms): 111.36 | crit-fwd(ms): 2.92 | bwd(ms): 114.13 | optim(ms): 33.48 | loss: 3.58643 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.20175 | 1_validate.lst-TER: 7.33 | 1_validate.lst-WER: 19.34 | 2_validate.lst-loss: 3.75308 | 2_validate.lst-TER: 16.19 | 2_validate.lst-WER: 43.17 | avg-isz: 000 | avg-tsz: 051 | max-tsz: 080 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-14 16:07:29 epoch: 1 | nupdates: 140 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 270.61 | smp(ms): 0.08 | fwd(ms): 119.74 | crit-fwd(ms): 2.99 | bwd(ms): 121.65 | optim(ms): 28.76 | loss: 2.22904 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.18414 | 1_validate.lst-TER: 7.27 | 1_validate.lst-WER: 19.19 | 2_validate.lst-loss: 3.76250 | 2_validate.lst-TER: 16.26 | 2_validate.lst-WER: 43.17 | avg-isz: 000 | avg-tsz: 061 | max-tsz: 076 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-14 18:24:34 epoch: 1 | nupdates: 144 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 262.31 | smp(ms): 0.09 | fwd(ms): 114.51 | crit-fwd(ms): 2.59 | bwd(ms): 116.05 | optim(ms): 31.13 | loss: 2.91033 | train-TER: 14.04 | train-WER: 34.88 | 1_validate.lst-loss: 2.16660 | 1_validate.lst-TER: 7.21 | 1_validate.lst-WER: 19.03 | 2_validate.lst-loss: 3.77681 | 2_validate.lst-TER: 16.36 | 2_validate.lst-WER: 43.17 | avg-isz: 000 | avg-tsz: 053 | max-tsz: 074 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-14 20:41:40 epoch: 1 | nupdates: 148 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:01 | bch(ms): 285.76 | smp(ms): 0.10 | fwd(ms): 117.49 | crit-fwd(ms): 3.06 | bwd(ms): 136.79 | optim(ms): 31.04 | loss: 4.59196 | train-TER: 0.00 | train-WER: 0.00 | 1_validate.lst-loss: 2.15192 | 1_validate.lst-TER: 7.17 | 1_validate.lst-WER: 18.92 | 2_validate.lst-loss: 3.78732 | 2_validate.lst-TER: 16.45 | 2_validate.lst-WER: 43.20 | avg-isz: 000 | avg-tsz: 049 | max-tsz: 119 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-14 22:58:47 epoch: 1 | nupdates: 152 | lr: 0.025000 | lrcriterion: 0.020000 | scale-factor: 1.000000 | runtime: 00:00:00 | bch(ms): 241.73 | smp(ms): 0.11 | fwd(ms): 104.81 | crit-fwd(ms): 2.53 | bwd(ms): 107.50 | optim(ms): 28.86 | loss: 4.07885 | train-TER: 4.63 | train-WER: 19.05 | 1_validate.lst-loss: 2.13969 | 1_validate.lst-TER: 7.18 | 1_validate.lst-WER: 18.93 | 2_validate.lst-loss: 3.75785 | 2_validate.lst-TER: 16.46 | 2_validate.lst-WER: 43.20 | avg-isz: 000 | avg-tsz: 040 | max-tsz: 070 | avr-batchsz: 4.00 | hrs: 0.00 | thrpt(sec/sec): 0.00 | timestamp: 2021-03-15 01:15:51

@tlikhomanenko thanks for your reply again, it's really helpful.

I am able to fork the model with the a more reasonable number for the reportiters now. Some more questions around training

I forked the model with two training lst files, and 2 validation lst files. I noticed that in the checkpoint directory there are 001_model_last.bin, and two other bin files for each of the validation lst files, e.g. 001_val1.lst.bin, 001_val2.lst.bin. When i do decoding with these 3 bin files I get exactly same results. Is there supposed to be any difference between these bin files?
This is related to 1. also, when we do training and validation on several lst files vs 1 lst file with all datasets combined, would the optimisation run differently, that would effect the WER?
Is there a command line flag that enables to save the model for each iteration in the checkpoint directory. I think the default behaviour is overwriting the model after each validation pass during training.
This is more of a scientific question. I am wondering whether finetuning for too long might make the network forget most of what it was trained on? In that respect is there any proper way to control this. Ideally we expect the model should be able to generalize both to the original training data as well as the new finetuning data.
To speedup training, is there any flag to enable amp.
I am using am_transformer_ctc_stride3_letters_70Mparams.bin as the base model,

a) am I right to say that this model can not be used for streaming inference, and only streaming covnets is applicable in this regard? b) Will the performance improve if I used the bigger model am_transformer_ctc_stride3_letters_300Mparams.bin

b) will the performance improve if I used a bigger model, such as, the am_transformer_ctc_stride3_letters_300Mparams.bin or switched to a conformer model instead?

@tlikhomanenko thanks for your reply again, it's really helpful.

I am able to fork the model with the a more reasonable number for the reportiters now. Some more questions around training
1. I forked the model with two training lst files, and 2 validation lst files. I noticed that in the checkpoint directory there are 001_model_last.bin, and two other bin files for each of the validation lst files, e.g. 001_val1.lst.bin, 001_val2.lst.bin. When i do decoding with these 3 bin files I get exactly same results. Is there supposed to be any difference between these bin files?

If you have the same state performing best on all val and at the same time this is last checkpoint - this could be, yes.

2. This is related to 1. also, when we do training and validation on several lst files vs 1 lst file with all datasets combined, would the optimisation run differently, that would effect the WER?

yep, this affects the best checkpoint selection. Best on average is not best on the subset. It is up to to define how you want to evaluate or pick the best.

3. Is there a command line flag that enables to save the model for each iteration in the checkpoint directory. I think the default behaviour is overwriting the model after each validation pass during training.

We don't support, we rewrite only if wer is improved, otherwise the old state is preserved. You can tweak the code to save any iteration or whatever logic you want.

4. This is more of a scientific question. I am wondering whether finetuning for too long might make the network forget most of what it was trained on? In that respect is there any proper way to control this. Ideally we expect the model should be able to generalize both to the original training data as well as the new finetuning data.

Depends how you train. If your lr is small - then probably it will not forget. Also the option is to include your pretrainedd list into training then time to time.

5. To speedup training, is there any flag to enable amp.

https://github.com/facebookresearch/flashlight/blob/master/flashlight/app/asr/common/Flags.cpp#L332 set it true

6. I am using am_transformer_ctc_stride3_letters_70Mparams.bin as the base model,
a) am I right to say that this model can not be used for streaming inference, and only streaming covnets is applicable in this regard? b) Will the performance improve if I used the bigger model am_transformer_ctc_stride3_letters_300Mparams.bin

a) yes (for our current implementation for inference, code supports only TDS). b) probably yes, it is better, see https://github.com/facebookresearch/wav2letter/tree/master/recipes/rasr#wer

b) will the performance improve if I used a bigger model, such as, the am_transformer_ctc_stride3_letters_300Mparams.bin or switched to a conformer model instead?

probably yes

This is more of a scientific question. I am wondering whether finetuning for too long might make the network forget most of what it was trained on? In that respect is there any proper way to control this. Ideally we expect the model should be able to generalize both to the original training data as well as the new finetuning data.

Depends how you train. If your lr is small - then probably it will not forget. Also the option is to include your pretrainedd list into training then time to time.

a) I am using --lr=0.025, is this small enough.

b) I dont have the pretrained list, do I have to download all the datasets that were used to train the RASR model. If so if you have any data preparation codes for the RASR recipes that would be very helpful.

I am using am_transformer_ctc_stride3_letters_70Mparams.bin as the base model, a) am I right to say that this model can not be used for streaming inference, and only streaming covnets is applicable in this regard? b) Will the performance improve if I used the bigger model am_transformer_ctc_stride3_letters_300Mparams.bin a) yes (for our current implementation for inference, code supports only TDS). b) probably yes, it is better, see https://github.com/facebookresearch/wav2letter/tree/master/recipes/rasr#wer

Is there any comparison of WER between RASR models vs streaming convnets.

b) will the performance improve if I used a bigger model, such as, the am_transformer_ctc_stride3_letters_300Mparams.bin or switched to a conformer model instead?

Is is possible to share the big LM arpa file corresponding to https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lm_common_crawl_large_4gram_prun0-0-5_200kvocab.bin

During acoustic model training is there any need to trim silences before and after each utterance? No need as we have special token for silence too "|" and you can use "--surround=|" to use it at the beginning and end. Ideally with training on silence your model should be more robust.

a) Just want to confirm regarding this as the tutorial here https://github.com/facebookresearch/flashlight/tree/master/flashlight/app/asr/tutorial mentions to do force alignment prior to fine tuning.

b) Currently I just finetune with the start and stop in list files which includes silences at the beginning and end of each utterance. I am using the 70m transformer RASR model and not using any special tokens, my lexicon is {a-z}

This is more of a scientific question. I am wondering whether finetuning for too long might make the network forget most of what it was trained on? In that respect is there any proper way to control this. Ideally we expect the model should be able to generalize both to the original training data as well as the new finetuning data.

Depends how you train. If your lr is small - then probably it will not forget. Also the option is to include your pretrainedd list into training then time to time.

a) I am using --lr=0.025, is this small enough.

Depends on optimizer, just check the original lr of the pretrained model.

b) I dont have the pretrained list, do I have to download all the datasets that were used to train the RASR model. If so if you have any data preparation codes for the RASR recipes that would be very helpful.

Yep, you need to download and prepare the data. We are planing to release data processing soon. Right now you can use librispeech validation data to check how model performs on them that it is not forgetting. However if the goal is to have best performance on your own data - better to look only on them (if you are sure test data will have similar distribution and properties).

I am using am_transformer_ctc_stride3_letters_70Mparams.bin as the base model, a) am I right to say that this model can not be used for streaming inference, and only streaming covnets is applicable in this regard? b) Will the performance improve if I used the bigger model am_transformer_ctc_stride3_letters_300Mparams.bin a) yes (for our current implementation for inference, code supports only TDS). b) probably yes, it is better, see https://github.com/facebookresearch/wav2letter/tree/master/recipes/rasr#wer

Is there any comparison of WER between RASR models vs streaming convnets.

You can compare them on Librispeech (see papers where we report numbers on Librispeech). And transfromers performs better than convs.

b) will the performance improve if I used a bigger model, such as, the am_transformer_ctc_stride3_letters_300Mparams.bin or switched to a conformer model instead?

Is is possible to share the big LM arpa file corresponding to https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lm_common_crawl_large_4gram_prun0-0-5_200kvocab.bin

It is 63 GB for arpa file, do you still want me to share it? (For small model I have published the arpa, check rasr recipe readme).

During acoustic model training is there any need to trim silences before and after each utterance? No need as we have special token for silence too "|" and you can use "--surround=|" to use it at the beginning and end. Ideally with training on silence your model should be more robust.

a) Just want to confirm regarding this as the tutorial here https://github.com/facebookresearch/flashlight/tree/master/flashlight/app/asr/tutorial mentions to do force alignment prior to fine tuning.

Force alignment is needed if you have samples say several minutes, because you cannot process such long sequences (OOM will be), so you can use force alignment to detect silences and do proper segmentation of the whole audio into pieces.

b) Currently I just finetune with the start and stop in list files which includes silences at the beginning and end of each utterance. I am using the 70m transformer RASR model and not using any special tokens, my lexicon is {a-z}

Do you use the same token set as we used for RASR? Otherwise you need to change last linear layer which maps the embedding into output tokens.

For silence you can it into the tokens set, provide it as a --worseparator and set --surround to this silence token.

Hi @tlikhomanenko I am happy to report that I am getting very good results with the RASR recipe finetuned on my data, a similar appraoch on NEMO showed disappointing results with the finetuned model doing worse than the base model, so thank you for this amazing work.

Is is possible to share the big LM arpa file corresponding to https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lm_common_crawl_large_4gram_prun0-0-5_200kvocab.bin

It is 63 GB for arpa file, do you still want me to share it? (For small model I have published the arpa, check rasr recipe readme).

Yes, can you please share. I want to see how I can push this even further.

During acoustic model training is there any need to trim silences before and after each utterance? No need as we have special token for silence too "|" and you can use "--surround=|" to use it at the beginning and end. Ideally with training on silence your model should be more robust.

a) Just want to confirm regarding this as the tutorial here https://github.com/facebookresearch/flashlight/tree/master/flashlight/app/asr/tutorial mentions to do force alignment prior to fine tuning.

Force alignment is needed if you have samples say several minutes, because you cannot process such long sequences (OOM will be), so you can use force alignment to detect silences and do proper segmentation of the whole audio into pieces.

b) Currently I just finetune with the start and stop in list files which includes silences at the beginning and end of each utterance. I am using the 70m transformer RASR model and not using any special tokens, my lexicon is {a-z}

Do you use the same token set as we used for RASR? Otherwise you need to change last linear layer which maps the embedding into output tokens.

Yes i am using {a-z}

For silence you can it into the tokens set, provide it as a --worseparator and set --surround to this silence token.

Sorry I didn't get you completely on this, can you give an example please. I have not done this step but still getting good results.

I have a couple of questions regarding limiting the decoder to output words only in the lexicon. However i noticed some unexpected behavior.

When i use the greedy decoder with uselexicon=true or false, it makes no difference to the output. I expected that when uselexicon=true the output will be restricted to words defined in the lexicon.
Changing the acoustic model from the finetuned to the base model makes a huge difference to the output regardless of the value of the uselexicon flag. Is the acoustic model learning some kind of internal language model?
I was trying the decoding with long audio files and the output seemed truncated, is there any integration available with a VAD or do you suggest using the forced aligner? I found this [https://github.com/facebookresearch/libri-light/tree/master/data_preparation#running-voice-activity-detection-and-snr-computation] but the links inside seem broken

Hi @tlikhomanenko I am happy to report that I am getting very good results with the RASR recipe finetuned on my data, a similar appraoch on NEMO showed disappointing results with the finetuned model doing worse than the base model, so thank you for this amazing work.

Happy to hear and glad that our models/work is useful =). If you can share numbers from our model compared to other frameworks models please post (curious to know how it looks like, but this is for sure up to you).

Yes, can you please share. I want to see how I can push this even further.

Uploading, will be available here https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lm_common_crawl_large_4gram_prun0-0-5_200kvocab.arpa

Sorry I didn't get you completely on this, can you give an example please. I have not done this step but still getting good results.

If you see some problems with silence prediction, or you want special silence prediction with your model you can add say <SIL> to {a-z} set and change last linear layer to map not to 28 tokens (a-z ' |) but to 29 tokens. But in your train data you need to have this additional token to be able to learn it.

I have a couple of questions regarding limiting the decoder to output words only in the lexicon. However i noticed some unexpected behavior.
1. When i use the greedy decoder with `uselexicon=true or false`, it makes no difference to the output. I expected that when uselexicon=true the output will be restricted to words defined in the lexicon.

Can you check that you have OOV (out of vocabulary) words, because if you don't have them - then it will be the same https://github.com/flashlight/flashlight/blob/master/flashlight/app/asr/Test.cpp#L317. Also if you have mistakes on OOV positions when use uselexicon=false then you will get the same WER with and without lexicon.

2. Changing the acoustic model from the finetuned to the base model makes a huge difference to the output regardless of the value of the uselexicon flag. Is the acoustic model learning some kind of internal language model?

see above comment, and yes, model is learning lm inside so that you can observe at some point that viterbi improves say 1% absolute but after decoding you see only 0.5% improvement.

3. I was trying the decoding with long audio files and the output seemed truncated, is there any integration available with a VAD or do you suggest using the forced aligner? I found this [https://github.com/facebookresearch/libri-light/tree/master/data_preparation#running-voice-activity-detection-and-snr-computation] but the links inside seem broken

Several people reported and we also noticed that that model performs poorly on long sequences. So yep, do segmentation before. For segmentation you can check tutorial we did https://github.com/flashlight/flashlight/blob/master/flashlight/app/asr/tutorial/notebooks/InferenceAndAlignmentCTC.ipynb (check alignment section, so even with current any model you can do segmentation, use some blank/silence duration time to make sure that it is really very confident place to cut the audio).

I am closing then issue as seems everything works now for you, but feel free to post other questions and comments here or with new issues in flashlight directly. Will be happy to help further!

Hi @tlikhomanenko I am happy to report that I am getting very good results with the RASR recipe finetuned on my data, a similar appraoch on NEMO showed disappointing results with the finetuned model doing worse than the base model, so thank you for this amazing work.

Happy to hear and glad that our models/work is useful =). If you can share numbers from our model compared to other frameworks models please post (curious to know how it looks like, but this is for sure up to you).

Sure, I thought you would be interested to know and might benefit others too.

On a large dataset (2000 hrs) with finetuned model reached dev WER ~5% after 100 iterations and test wer 15%

On an independent in-house dataset (30 mins) with standard English vocabulary but varying accents, WER reduced from 24 % (KALDI) to 14% (RSR-25 M) to 12% (RSR-25 M fine-tuned from larger dataset). This dataset was not used in the fine-tuning step and is totally independent in terms of recording conditions, speakers, transcripts etc.

If you want to know about NEMO, I did not do a very comprehensive test but even after several iterations the model was not able to show improvements than the baseline model on the test sequences.

I have a couple of questions regarding limiting the decoder to output words only in the lexicon. However i noticed some unexpected behavior.
1. When i use the greedy decoder with `uselexicon=true or false`, it makes no difference to the output. I expected that when uselexicon=true the output will be restricted to words defined in the lexicon.
Can you check that you have OOV (out of vocabulary) words, because if you don't have them - then it will be the same https://github.com/flashlight/flashlight/blob/master/flashlight/app/asr/Test.cpp#L317. Also if you have mistakes on OOV positions when use uselexicon=false then you will get the same WER with and without lexicon.

Can you please share a bit more on this, as I am not sure how to find OOV words from the code itself. Where are the OOV words specified, the testing sequences includes some words which were not seen in training by the acoustic model. I added these words in a new lexicon file hoping that these can then be decoded but did not notice any difference. I also reduced the lexicon file to a couple of words only and yet it shows the same output as having a bigger lexicon. The only difference i Noticed is that the WER changes but the actual output is the same when change the uselexicon flag value (lower WER when uselexicon=True and vice versa).
3. I was trying the decoding with long audio files and the output seemed truncated, is there any integration available with a VAD or do you suggest using the forced aligner? I found this [https://github.com/facebookresearch/libri-light/tree/master/data_preparation#running-voice-activity-detection-and-snr-computation] but the links inside seem broken
Several people reported and we also noticed that that model performs poorly on long sequences. So yep, do segmentation before. For segmentation you can check tutorial we did https://github.com/flashlight/flashlight/blob/master/flashlight/app/asr/tutorial/notebooks/InferenceAndAlignmentCTC.ipynb (check alignment section, so even with current any model you can do segmentation, use some blank/silence duration time to make sure that it is really very confident place to cut the audio).

Did you mean VAD, as alignment means you already have a transcript at hand and you want to get the time-stamps of it?
For my use case, I expect users to send audio files from a few seconds to several minutes for the model to decode. I tried the VAD from the notebook with my model. The vad output seems to be output of viterbi decoder , i thought since we specified lm should be the output from language model decoder in the *.tsc files
For a longer audio file (30 seconds) I am getting this .tsc which is totally off

lcetaorisolate|uooi

It would be very helpful if I can improve the performance with longer sentences. I am wondering is there any inherent limitation in wav2letter which makes it only work with shorter sentences?

Hey @sciai-ai could you mind fix your above comment to be sure the reply stuff is with the correct formatting - cannot distinguish now my responses and your new questions =)

@tlikhomanenko Sorry about the formatting earlier, I have fixed it.

Thanks!

Can you please share a bit more on this, as I am not sure how to find OOV words from the code itself. Where are the OOV words specified, the testing sequences includes some words which were not seen in training by the acoustic model. I added these words in a new lexicon file hoping that these can then be decoded but did not notice any difference. I also reduced the lexicon file to a couple of words only and yet it shows the same output as having a bigger lexicon. The only difference i Noticed is that the WER changes but the actual output is the same when change the uselexicon flag value (lower WER when uselexicon=True and vice versa).

The output from test.cpp is your predictions on token level, so it doesn't matter what lexicon, use/not use lexicon you use, output is just your exact predictions. The only thing which can change is WER because it is computed on word level and we use lexicon to map letter sequence to words or we don't use lexicon and just treat wordseparator as speca the rest tokens just join together. In case of use lexicon OOVs are defined as words which are not in lexicon.

1. Did you mean VAD, as alignment means you already have a transcript at hand and you want to get the time-stamps of it?

Sorry, yep, VAD if you have no transcripts. You can use blank/silence predictions to segment.

2. For my use case, I expect users to send audio files from a few seconds to several minutes for the model to decode.
   I tried the VAD from the notebook with my model. The vad output seems to be output of viterbi decoder , i thought since we specified lm should be the output from language model decoder in the *.tsc files

I think VAD on viterbi should work fine (even it is quick) as mostly you need blank predictions and they are predicted very strongly even without LM. Otherwise you can run decode.cpp and just add printing into files all frames from the best hyp.

3. For a longer audio file (30 seconds) I am getting this .tsc which is totally off

lcetaorisolate|uooi

This is strange, for 30s it should work as Librispeech include long sequences. Do you have 30s audio in your data on which you ft? What is the output if you apply RASR model without ft on this sample?

It would be very helpful if I can improve the performance with longer sentences. I am wondering is there any inherent limitation in wav2letter which makes it only work with shorter sentences?

The problem is the data on which we train and what / how positional embedding / transformer we use. As soon as we have any update on this for long sequences, will post. One thing you can do it finetune on longer sequences.

Thanks!

You are welcome :)

Can you please share a bit more on this, as I am not sure how to find OOV words from the code itself. Where are the OOV words specified, the testing sequences includes some words which were not seen in training by the acoustic model. I added these words in a new lexicon file hoping that these can then be decoded but did not notice any difference. I also reduced the lexicon file to a couple of words only and yet it shows the same output as having a bigger lexicon. The only difference i Noticed is that the WER changes but the actual output is the same when change the uselexicon flag value (lower WER when uselexicon=True and vice versa).

The output from test.cpp is your predictions on token level, so it doesn't matter what lexicon, use/not use lexicon you use, output is just your exact predictions. The only thing which can change is WER because it is computed on word level and we use lexicon to map letter sequence to words or we don't use lexicon and just treat wordseparator as speca the rest tokens just join together. In case of use lexicon OOVs are defined as words which are not in lexicon.

That makes perfect sense to me now.
1. Did you mean VAD, as alignment means you already have a transcript at hand and you want to get the time-stamps of it?
Sorry, yep, VAD if you have no transcripts. You can use blank/silence predictions to segment.
2. For my use case, I expect users to send audio files from a few seconds to several minutes for the model to decode.
   I tried the VAD from the notebook with my model. The vad output seems to be output of viterbi decoder , i thought since we specified lm should be the output from language model decoder in the *.tsc files
I think VAD on viterbi should work fine (even it is quick) as mostly you need blank predictions and they are predicted very strongly even without LM. Otherwise you can run decode.cpp and just add printing into files all frames from the best hyp.
3. For a longer audio file (30 seconds) I am getting this .tsc which is totally off
lcetaorisolate|uooi
This is strange, for 30s it should work as Librispeech include long sequences. Do you have 30s audio in your data on which you ft? What is the output if you apply RASR model without ft on this sample?

It would be very helpful if I can improve the performance with longer sentences. I am wondering is there any inherent limitation in wav2letter which makes it only work with shorter sentences?

The problem is the data on which we train and what / how positional embedding / transformer we use. As soon as we have any update on this for long sequences, will post. One thing you can do it finetune on longer sequences.

A few points on this:

You are right, when i used the RASR model, the decoder can output reasonable results.
My finetuned model was limited to 15 seconds long audio. The reason was OOM (quoting our previous conversation on this):

The error says itself what is the problem "Failed to allocate memory of size 152.00 MiB (Device: 0, Capacity: 14.75 GiB, Allocated: 13.93 GiB, Cached: 721.55 MiB) with error 'ArrayFire Exception (Device out of memory:101):" - so you have OOM on GPU.

As soon as this happens on some particular batch - check what is the longest audio you have. Probably you need to filter your list file from the very long audio to be ok with memory with your current batch size.

As you pointed out, the ability of the model to properly recognise longer sentences seems to be directly affected by the length distribution of the training samples. What if the training data only has shorter sentences and or constrained by the GPU memory? What GPUs would you recommend that would help to alleviate this use to some degree?

My finetuned model was limited to 15 seconds long audio. The reason was OOM (quoting our previous conversation on this):

So if you are restricted with memory on training, then at test time one of the solution is do segmentation and apply on the same 15s. If I have some solution meantime will share.

As you pointed out, the ability of the model to properly recognize longer sentences seems to be directly affected by the length distribution of the training samples. What if the training data only has shorter sentences and or constrained by the GPU memory? What GPUs would you recommend that would help to alleviate this use to some degree?

Use smaller batches and longer seq. Right now per GPU we can pack 240s for 32Gb gpu. One obvious thing is just stack samples together to form long seq and use batch=1. Other solutions are pure research. Also you can try to use streaming models people do for transformers or conv based models where locality dependencies only used.

My finetuned model was limited to 15 seconds long audio. The reason was OOM (quoting our previous conversation on this):

So if you are restricted with memory on training, then at test time one of the solution is do segmentation and apply on the same 15s. If I have some solution meantime will share.

As you pointed out, the ability of the model to properly recognize longer sentences seems to be directly affected by the length distribution of the training samples. What if the training data only has shorter sentences and or constrained by the GPU memory? What GPUs would you recommend that would help to alleviate this use to some degree?

Use smaller batches and longer seq. Right now per GPU we can pack 240s for 32Gb gpu. One obvious thing is just stack samples together to form long seq and use batch=1. Other solutions are pure research. Also you can try to use streaming models people do for transformers or conv based models where locality dependencies only used.

@tlikhomanenko

For concatenating longer sequences, will the model we able to recognise the full stop between sentences? and output in a similar way during decoding.
What would be possible research topics, I was wondering is it related to the model architecture/ attention mechanism ?
I wanted to know if any augmentation (e.eg specaugment) is applied to the baseline RASR model. Is there any way i can apply augmentation in the baseline model too, to make it noise robust, for example? or this is happening by default when we finetune?
For streaming ASR model, can i finetune the base model in the same way as the RASR model. Would be nice if we have RASR streaming counterpart in which the baseline model has been pre-trained on several datasets. I think the current streaming base model has only been trained on librispeech?
I noticed that the decoder takes a while to load up the language model and that takes the bulk of the time in inference. Is there a way to do a one time model loading so that subsequent decoding/ inference is fast?

Hey!

1. For concatenating longer sequences, will the model we able to recognise the full stop between sentences? and output  in a similar way during decoding.

What do you mean by full stop?

2. What would be possible research topics, I was wondering is it related to the model architecture/ attention mechanism ?

open question :) One example in my mind: for sure streaming models will not have this problems as it is streaming and restricted to particular context.

3. I wanted to know if any augmentation (e.eg specaugment) is applied to the baseline RASR model. Is there any way i can apply augmentation in the baseline model too, to make it noise robust, for example? or this is happening by default when we finetune?

specaug is used in the baseline models. If you fine-tune it will be using after certain number of updates. To tweak it check here params: https://github.com/flashlight/flashlight/blob/master/flashlight/app/asr/common/Flags.cpp#L151-L156, https://github.com/flashlight/flashlight/blob/master/flashlight/app/asr/common/Flags.cpp#L244-L266 (check in the log what are values for them when you run fine-tuning)

4. For streaming ASR model, can i finetune the base model in the same way as the RASR model. Would be nice if we have RASR streaming counterpart in which the baseline model has been pre-trained on several datasets. I think the current streaming base model has only been trained on librispeech?

Yep, agree. For sure if we have something we will release it :)

5. I noticed that the decoder takes a while to load up the language model and that takes the bulk of the time in inference. Is there a way to do a one time model loading so that subsequent decoding/ inference is fast?

You need to tweak the code to have it with infinite loop to get inputs. Example on this check here https://github.com/flashlight/flashlight/blob/master/flashlight/app/asr/tutorial/InferenceCTC.cpp (I did it here, so you can provide new inputs and it will process without reloading models).

flashlight / wav2letter

Fine tune CTC model #947