NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.03k stars 2.51k forks source link

How to train/Finetune for Custom AD(Activity Detection), like laughing, etc #5082

Closed m-ali-awan closed 1 year ago

m-ali-awan commented 2 years ago

Hi all!

I am working on developing pipelines for Laughing, Crying etc. I want to achieve them, like VAD in pyannote/Nemo/silero pipeline. How can I use Nemofor training purposes? Moreover, I want to use the pipeline on device: Android/IoS, so kindly also guide me in that direction as well..

Thanks

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

m-ali-awan commented 1 year ago

Hi there, kindly provide me any possible resources, thanks

titu1994 commented 1 year ago

@fayejf

m-ali-awan commented 1 year ago

Hi @fayejf, hope you are fine. Kindly help me with this,

Thanks a lot...

fayejf commented 1 year ago

Hi m-ali-awan Sorry I missed this issue. Thanks for your question and patience. Yeah you could definitely do so

Laughing, Crying etc. I want to achieve them, like VAD

Laughing Crying has been classified as background (which means nonspeech) in NeMo VAD models. You can change labels in decoder to have "background", "laughter", "crying" in model yaml file and fine-tune the decoder (or maybe fine-tune encoder for better result) using your data. You can have a look at tutorial how to fine-tune the model with changing the label and freeze (or not) encoder

How can I use Nemofor training purposes? Moreover, I want to use the pipeline on device: Android/IoS, so kindly also guide me in that direction as well..

NeMo VAD model support ONNX export, you could find it in tutorial. And the whole VAD pipeline (including post-processing) is torch-jit-scriptable.

m-ali-awan commented 1 year ago

Hi @fayejf , thanks a lot for sharing me the resources. I have few considerations, right now VAD is a binary classification, and so in Inference Tutorials, we are seeing about Binarization. So, it would be great if you can share me some example/tutorial with multi-class classification. As right now, I am looking for following labels: Background/BabyLaughing/BabyCrying.. Moreover, as right now GPU is required for training MarbleNet based models, can I later deploy them on CPU based hardware, as I am planning to have mobile Deployment later. And even, in start I will go for serverless deployment using AWS Lambda, and we don't have GPU option there, and max memory we can have is 10 GB, so will it be ok for these requirements...

Thanks for all your help

m-ali-awan commented 1 year ago

Hi @fayejf Here you can see, we are making manifest file for binary classification, image

How, will we handle multi-class classification here..?

Thanks for your help..

fayejf commented 1 year ago

@m-ali-awan Binarization is used to VAD postprocessing to convert list of speech probability such as [0.6, 0.8, 0.1, 0...] to speech timestamp [[start0, end0], [start1,end1]] You would not need binarization for multiclassification, say you have three classes. [speech, background, cry] and the output prob will be like [0.1, 0.1, 0.8] after softmax and you could just consider this as a cry segment. After model inference, you would see [cry, cry, cry, background,background,background,cry,cry,cry]

you could consider extend binarization/post-processing to multi-class but it's not required.

I have checked run the inference tutorial (load model, microphone input,) on a machine without GPU and it worked. But I'm not familiar with AWS Lambda or deployment environment. Maybe check ONNX?

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.