Closed DieguJota closed 2 years ago
Most Nemo models are not going to do well with significant background music or noise, they are not trained to be robust to such environments. You may try augmenting your training with music clips but it will most probably significantly damage ASR wer.
It seems like you could have a better ASR result by employing unmix technique (source separation from mixed vocal/music). Train/fine-tuning an ASR model with background music could easily drop the accuracy on clean speech so it should be done with meticulous performance monitoring on many different SNRs. There is a good opensource project for source separation and unmix called Asteroid.
Hi,
I've been testing the NeMo transcription engine for some time now and I came across a question. In the case of voice with background music, how would NeMo behave? Would I need to train with audios with background music? Or would that be a problem?
An example, a person narrating some text and an audio in the background. I would like to have the result of what the person is saying.