NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.84k stars 2.46k forks source link

GPU utilization is zero #3283

Closed awsomecod closed 2 years ago

awsomecod commented 2 years ago

I run the following commands:

import nemo
import nemo.collections.asr as nemo_asr
import nemo.collections.nlp as nemo_nlp
import nemo.collections.tts as nemo_tts

quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_en_quartznet15x5").cuda()
English_text = quartznet.transcribe(['~/audio1.wav'])

But GPU utilization is zero as it can be seen from the output of nvidia-smi : nv

titu1994 commented 2 years ago

Transcribing a single audio file will probably be done in a second or so. Models don't take so much memory during inference.

awsomecod commented 2 years ago

This audio file is 50 mega byte and is 21 minutes long.

awsomecod commented 2 years ago

I guess, the inference is running on CPU instead of GPU even though I am using cuda() function in my codes. The inference takes about 70 seconds.

awsomecod commented 2 years ago

I am running on colab and I have installed Apex. But I receive a warning saying that Kaldi root is not found. Is this the source if issue?

titu1994 commented 2 years ago

That's not used. Colab can't monitor Nvidia smi in parallel when code is run as far as I know. Even then, you would need to use watch -n 1 nvidia-smi to make it work.

awsomecod commented 2 years ago

I use to keep typing nvidia-smi over and over, but using watch -n 1 nvidia-smi is definitely more convenient.

I want to know how much GPU RAM do I need to run the inference on GPU for a single audio file of 50 MB. Can you let me know the answer. Cause I don't know any other command beside nvidia-smi to run on colab.

titu1994 commented 2 years ago

Depends on too many factors like length of audio in seconds and fp16 or 32 etc. You'll need to find out some other way. QuartzNet with AMP and 32 GB can easily do 2 hour long audio clips.