NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.01k stars 2.5k forks source link

Speech transcription with background music #3923

Closed DieguJota closed 2 years ago

DieguJota commented 2 years ago

Hi,

I've been testing the NeMo transcription engine for some time now and I came across a question. In the case of voice with background music, how would NeMo behave? Would I need to train with audios with background music? Or would that be a problem?

An example, a person narrating some text and an audio in the background. I would like to have the result of what the person is saying.

titu1994 commented 2 years ago

Most Nemo models are not going to do well with significant background music or noise, they are not trained to be robust to such environments. You may try augmenting your training with music clips but it will most probably significantly damage ASR wer.

tango4j commented 2 years ago

It seems like you could have a better ASR result by employing unmix technique (source separation from mixed vocal/music). Train/fine-tuning an ASR model with background music could easily drop the accuracy on clean speech so it should be done with meticulous performance monitoring on many different SNRs. There is a good opensource project for source separation and unmix called Asteroid.