Update to pytorch1.2 and python3.
This is a CTC-based speech recognition system with pytorch.
At present, the system only supports phoneme recognition.
You can also do it at word-level and may get a high error rate.
Another way is to decode with a lexcion and word-level language model using WFST which is not included in this system.
English Corpus: Timit
Chinese Corpus: 863 Corpus
Speaker | UtterId | Utterances |
---|---|---|
M50, F50 | A1-A521, AW1-AW129 | 650 sentences |
M54, F54 | B522-B1040,BW130-BW259 | 649 sentences |
M60, F60 | C1041-C1560 CW260-CW388 | 649 sentences |
M64, F64 | D1-D625 | 625 sentences |
All | 5146 sentences |
Speaker | UtterId | Utterances |
---|---|---|
M51, F51 | A1-A100 | 100 sentences |
M55, F55 | B522-B521 | 100 sentences |
M61, F61 | C1041-C1140 | 100 sentences |
M63, F63 | D1-D100 | 100 sentences |
All | 800 sentences |
pip3 install visdom
python -m visdom.server
pip install -r requirements.txt
bash run.sh data_prepare + AM training + LM training + testing
bash run.sh 1 AM training + LM training + testing
bash run.sh 2 LM training + testing
bash run.sh 3 testing
RNN LM training is not implemented yet. They are added to the todo-list.
Adjust the learning rate if the dev loss is around a specific loss for ten times.
Times of adjusting learning rate is 8 which can be alter in steps/train_ctc.py(line367).
Optimizer is nn.optimizer.Adam with weigth decay 0.005
Take the max prob of outputs as the result and get the path.
Calculate the WER and CER by used the function of the class.
Implemented with python. Original Code
I fix it to support phoneme for batch decode.
Beamsearch can improve about 0.2% of phonome accuracy.
Phoneme-level language model is inserted to beam search decoder now.