NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.01k stars 2.5k forks source link

How to reduce the vocabulary for a QuartzNet model? #4245

Closed leonardltk closed 2 years ago

leonardltk commented 2 years ago

For a simple script like this:


#!/usr/bin/env python
import nemo.collections.asr as nemo_asr
import ctc_decoders

import numpy as np
import librosa

import os, sys, pdb

## Instantiate model
DetailsDict = nemo_asr.models.EncDecCTCModel.list_available_models()
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name='QuartzNet15x5Base-En', strict=False)

## Test audio clip
AUDIO_FILENAME = 'test.wav'
signal, sample_rate = librosa.load(AUDIO_FILENAME, sr=None)

## Convert our audio sample to text
files = [AUDIO_FILENAME]
transcript = asr_model.transcribe(paths2audio_files=files)[0]; print(f'Transcript: "{transcript}"')

It performs well However if i change the problem of just predicting fixed number of words, for example yes, no How should i augment the language model part of it ?

titu1994 commented 2 years ago

If you only need to predict a few words, you should train a much more lightweight speech classification model such as matchboxnet. If you absolutely want to use QuartzNet, which is a speech recognition model, you can try many things such as finetuning on those specific words, use an LM which is trained primarily on variations of those words or you can try using a newer technique such as adapters to train on very small dataset of those words.

If you're using Riva, you can also use word boosting to try to force your model to predict those words. But know that speech recognition will never be as efficient or as accurate as specialized speech classification models by design

leonardltk commented 2 years ago

thanks @titu1994 for the feedback!

Yes definitely im thinking to try a lighter weight model next time. For now, im interested in how to perform the word boosting method that you mentioned, is there a tutorial here or resource u could point me to for that ?

leonardltk commented 2 years ago

the reason is because the sentences that i want to predict might be yes its me no its not me so the overall dictionary might be dynamic throughout my experimentation

titu1994 commented 2 years ago

Nemo does not support word boosting. Facebook Flashlight decoder supports it in the Riva framework, you can export Nemo QuartzNet model to use in that framework

leonardltk commented 2 years ago

Oh i see, thanks! In that case, perhaps we can have Word Boosting as a feature request ?

titu1994 commented 2 years ago

We won't be supporting that. It has a very messy c++ codebase dependency which we don't want to introduce to Nemo.

VahidooX commented 2 years ago

For word boosting, your other option is to use this decoder with nemo models: https://github.com/kensho-technologies/pyctcdecode