Closed isaacleeai closed 5 years ago
AFAIK the current version does not include a learnable front-end or convolutional language model. The n-gram language model is trained using KenLM. Train
trains the acoustic model. (Note: I'm not one of the authors, please correct if what I wrote is not the case.)
could you please share what lead you to think that? If that is the case, there is much more work to do...
The reason why I assumed it provided everything but the language model was on the data preperation guide, they mention we need a pre-trained language model, but nothing about the front end.
The current code extracts MFCCs (or MFSCs), i.e., it does not use (or train) a "learnable front end" in the sense of Zeghidour et al. (2018) (see https://github.com/facebookresearch/wav2letter/blob/master/src/data/Featurize.cpp).
@isaacleeai — your understanding is correct. The acoustic model generates emissions of units (generally letters) and computes a figure for LER vis-a-vis some criterion. The decoding step takes these emissions and, with a lexicon and a language model, produces transcriptions. The Train
binary that wav2letter produces, for now, only trains the acoustic model, and only requires a collection of tokens (a "dictionary").
@entn-at is correct — we haven't yet open sourced the learnable frontend or the decoder that supports convolutional language models, but this will be happening soon!
@entn-at Thanks for the clarification. So instead of the learnable frontend, it is using either MFCC or MFSC? Does that mean the Train.cpp
trains either one of MFCC or MFSC and the acoustic model?
@jacobkahn I'm a little confused after reading responses from you and @entn-at. Just to make sure:
Train.cpp
only trains the acoustic model and does not train learnable front-end or convolutional language model. I was successfully follow the steps provided in the tutorial. So, does that mean that, in the tutorial, there is no learnable front-end in the code, or are we using a pretrained one, like we are for convolutional language model?
When training for a different language, my understanding is, I need to provide: 1) lexicon, 2) tokens, 3) training dataset, 4) pretrained language model. And when I run train.cpp, it will train only 2 of 3 parts: the learnable front-end, acoustic model, and not the language model. Am I correct? Is there any other input we need to provide in order to train the model for a different language?
Also, it says we can use pretrained n-gram model instead of the convolutional language model. However, in the paper, "FULLY CONVOLUTIONAL SPEECH RECOGNITION", the difference in WER between the two LMs are 1%. So is there no way to train convolutional language model?
Thanks