Add Word Level Laguage Model to ASR Models

ahkarami commented 2 years ago

Hi, I have a question. How one can add a Word Level Pre-Trained Language Model (e.g., BERT or DistilBERT from HuggingFace) to an ASR model (e.g., character based like QuartzNet or Token based like Citrinet) for inference time? Best

okuchaiev commented 2 years ago

@titu1994 any suggestions here?

VahidooX commented 2 years ago

You may use the following script to use pretrained models of HF as a neural rescorer with ASR models:

https://github.com/NVIDIA/NeMo/blob/main/scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py#L179

You can find more info here in the docs:

https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/asr_language_modeling.html#neural-rescoring

This script does not support the use of MLM based models like BERT as they are not efficient to be used as LM. You may find more detail on why here in these discussion threads:

https://github.com/NVIDIA/NeMo/discussions/2572

Suggest to try auto-regressive models like gpt2 or transfo-xl-wt103 instead of BERT.

ahkarami commented 2 years ago

Thank you very much for your complete explanation. I have just 2 another questions: 1- Can one use LSTM-based language models (e.g., AWD-LSTM or ULMFiT) for ASR language modeling in Nemo? 2- I think, generally, for a specific domain, transfo-xl-wt103 LM has the best accuracy (in comparison of N-gram LMs and GPT2). Am I correct? and also in the view point of generality aspect, which one is better? I mean, for example, if one want to prepare a semi-general ASR model with LM that has an appropriate accuracy in some domains, which one is better? Best

VahidooX commented 2 years ago

1- Can one use LSTM-based language models (e.g., AWD-LSTM or ULMFiT) for ASR language modeling in Nemo? You can use any LM which is capable of estimating the likelihood of a sentence. But you may need to update the eval_neural_rescorer.py to call/use them properly.

2- I think, generally, for a specific domain, transfo-xl-wt103 LM has the best accuracy (in comparison of N-gram LMs and GPT2). Am I correct? and also in the view point of generality aspect, which one is better? I mean, for example, if one want to prepare a semi-general ASR model with LM that has an appropriate accuracy in some domains, which one is better?

We have a pretrained Transformer (GPT style) trained for English already here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/asrlm_en_transformer_large_ls . This model can be better for ASR comparing to transfo-xl-wt103. It depends on your domain and evaluation set.

Rescoring can be slightly better than N-gram but not necessarily. The best result is achieved when they are both used together. I suggest to start with an N-gram LM as it very fast and easy to train and they do not increase your inference time significantly. If you want to use rescoring, you need to perform beam search decoding anyway and adding N-gram to beam search decoding would not increase the inference time significantly. In our experiments on LS, pretrained models like transfo-xl-wt103 were worse than our pretrained model trained on language modelling LS text corpus. LM Models trained on general text may not show very promising results when evaluated on a specific domain.

ahkarami commented 2 years ago

Dear @VahidooX , Thank you for your complete answers. Best

NVIDIA / NeMo

Add Word Level Laguage Model to ASR Models #3382