NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.84k stars 2.46k forks source link

asr accuracy on single words #541

Closed bill-kalog closed 3 years ago

bill-kalog commented 4 years ago

Hi,

have you noticed reduced accuracy when inferring single word audio files with quartznet or jasper compared to longer sentences? Do you think kernel size might be affecting things?

okuchaiev commented 4 years ago

Context helps with recognition even for "greedy" decoding. Jasper and QuartzNet implicitly learn pretty good language models. Therefore it is much easier for them to correctly spell the word "constitution" as part of the phrase "founding fathers wrote constitution".