[Question] Is it possible to implement BERT in ASR?

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Apache License 2.0

11.84k stars 2.46k forks source link

Hi all, i am workin on a quartznet model with 5-gram beam search LM for decode the CTC matrix. I was wondering if it could be possible to implement a BERT-based LM on top of my asr model.

If i simply mask the uncorrect transcripted word and then run BERT, of course the word would be replaced with a new one, but not necessarily the correct one. BERT could replace a mispelled word with a totally different word that fits right in the context. I guess BERT should be fed with audio waveform information to replace the word with the correct one instead with a different one.

Is it possible? Any suggestion on related work?

NVIDIA / NeMo

[Question] Is it possible to implement BERT in ASR? #1733