KenLM with ASR - Githubissues

dungnguyen98 commented 2 years ago

Hi, I'm fine tuning Quartznet (character model) and CTC-Conformer (BPE model) on new language and the results are good. Then I use KenLM with both. I have some question:

When I download and see "3-gram.pruned.1e-7.arpa" file in this tutorial, I recognize that this file for word level, but Quartznet is character level, but It work well. Can you explain how it work ?
I train KenLM for CTC-Conformer (BPE model) according to asr language model tutorial, and the training was successful. But when I apply it with beamsearch to CTC-Conformer, it gives me irrelevant results(wer upto 40%), while using greedy search wer ~ 9%. I don't know what makes it not work. Can you give me some advice? Thank you!

VahidooX commented 2 years ago

This LM ("3-gram.pruned.1e-7.arpa") is trained for char-level models and does not work with BPE models. KenLM just supports word-level LM. The beam search decoder we use handles this. It gets a word-level KenLM model and perform the decoding by considering this into account.

2.Have about first trying that script with decoding_mode=beamsearch or greedy to make sure everything else is OK. Have you played with parameters alpha and beta? Has encoding_level in train_kenlm.py set to "subword" when you run it?

leminhnguyen commented 5 months ago

hi @dungnguyen98 have you successfully train the kenlm with BPE ?

NVIDIA / NeMo

KenLM with ASR #3221