[Question] How do you transcribe audio from mic input

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Apache License 2.0

11.83k stars 2.46k forks source link

[Question] How do you transcribe audio from mic input #1589

Closed aryamansriram closed 3 years ago

aryamansriram commented 3 years ago

Describe your question

I want to transcribe audio input from a microphone without saving it as an audio file

Environment overview (please complete the following information)

Environment location: Local
Method of NeMo install: pip install nemo_toolkit[all]==1.0.0b2

Environment details OS Version: Ubuntu 18.04 Python version: 3.7 Pytorch Version: 1.7.1 If NVIDIA docker image is used you don't need to specify these.

rbracco commented 3 years ago

This is discussed in #1486, there is no built in way to do it but instead you need to write your own code to convert the mic input into a tensor that you pass directly to the model (numpy.frombuffer might be helpful). You then need to pass that tensor as input to your saved nemo model (you wont be able to use the transcribe function but instead will have to decode manually as is done in validate in the asr_tutorial). Don't forget to put your model into eval mode (just quartznet.eval() if using quartznet) as the asr_tutorial has a bug and doesn't do this, and it causes training augmentations to be applied during inference.

I'm happy to try to help and answer any more questions if you have them. Good luck.

aryamansriram commented 3 years ago

Thanks @rbracco for the reply, closing this issue now, I'll try to implement it myself

ridasaleem0 commented 3 years ago

This is discussed in #1486, there is no built in way to do it but instead you need to write your own code to convert the mic input into a tensor that you pass directly to the model (numpy.frombuffer might be helpful). You then need to pass that tensor as input to your saved nemo model (you wont be able to use the transcribe function but instead will have to decode manually as is done in validate in the asr_tutorial). Don't forget to put your model into eval mode (just quartznet.eval() if using quartznet) as the asr_tutorial has a bug and doesn't do this, and it causes training augmentations to be applied during inference.

I'm happy to try to help and answer any more questions if you have them. Good luck.

Can you probably elaboate a little bit more and provide with your notebook if already implemented?

homevk15 commented 1 year ago

Thanks @rbracco for the reply, closing this issue now, I'll try to implement it myself

Hi. Have you implemented the conversion audio from mic input into tensor? I need it very much. Can we discuss it?

titu1994 commented 1 year ago

Please do not reply to long closed threads, open a new one c

Also, we have not yet implemented audio from mic.