flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

Training for a different language #181

Closed isaacleeai closed 5 years ago

isaacleeai commented 5 years ago

When training for a different language, my understanding is, I need to provide: 1) lexicon, 2) tokens, 3) training dataset, 4) pretrained language model. And when I run train.cpp, it will train only 2 of 3 parts: the learnable front-end, acoustic model, and not the language model. Am I correct? Is there any other input we need to provide in order to train the model for a different language?

Also, it says we can use pretrained n-gram model instead of the convolutional language model. However, in the paper, "FULLY CONVOLUTIONAL SPEECH RECOGNITION", the difference in WER between the two LMs are 1%. So is there no way to train convolutional language model?

Thanks

entn-at commented 5 years ago

AFAIK the current version does not include a learnable front-end or convolutional language model. The n-gram language model is trained using KenLM. Train trains the acoustic model. (Note: I'm not one of the authors, please correct if what I wrote is not the case.)

isaacleeai commented 5 years ago

could you please share what lead you to think that? If that is the case, there is much more work to do...

The reason why I assumed it provided everything but the language model was on the data preperation guide, they mention we need a pre-trained language model, but nothing about the front end.

entn-at commented 5 years ago

The current code extracts MFCCs (or MFSCs), i.e., it does not use (or train) a "learnable front end" in the sense of Zeghidour et al. (2018) (see https://github.com/facebookresearch/wav2letter/blob/master/src/data/Featurize.cpp).

jacobkahn commented 5 years ago

@isaacleeai — your understanding is correct. The acoustic model generates emissions of units (generally letters) and computes a figure for LER vis-a-vis some criterion. The decoding step takes these emissions and, with a lexicon and a language model, produces transcriptions. The Train binary that wav2letter produces, for now, only trains the acoustic model, and only requires a collection of tokens (a "dictionary").

@entn-at is correct — we haven't yet open sourced the learnable frontend or the decoder that supports convolutional language models, but this will be happening soon!

isaacleeai commented 5 years ago

@entn-at Thanks for the clarification. So instead of the learnable frontend, it is using either MFCC or MFSC? Does that mean the Train.cpp trains either one of MFCC or MFSC and the acoustic model?

@jacobkahn I'm a little confused after reading responses from you and @entn-at. Just to make sure:

  1. Train.cpp only trains the acoustic model and does not train learnable front-end or convolutional language model. I was successfully follow the steps provided in the tutorial. So, does that mean that, in the tutorial, there is no learnable front-end in the code, or are we using a pretrained one, like we are for convolutional language model?
  2. could you provide any window for us to expect the release ( very vague is fine. weeks vs months vs years is really all I need. I need to know if we should develop it ourselves or wait )? And will the release contain both trainable learnable front end and convolutional language model?
jacobkahn commented 5 years ago
  1. wav2letter++ does not currently support a learnable frontend. There is no pretrained frontend either: features are raw mel-filterbanks (or power spectra/whatever you specify, etc). The LM is only involved in the decoding stage, and we have not open sourced decoding with the convolutional LM.
  2. My guess is a few weeks for the learnable frontend and a few months for the conv LM.