Closed m-ali-awan closed 1 year ago
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Hi there, kindly provide me any possible resources, thanks
@fayejf
Hi @fayejf, hope you are fine. Kindly help me with this,
Thanks a lot...
Hi m-ali-awan Sorry I missed this issue. Thanks for your question and patience. Yeah you could definitely do so
Laughing, Crying etc. I want to achieve them, like VAD
Laughing Crying has been classified as background (which means nonspeech) in NeMo VAD models. You can change labels in decoder to have "background", "laughter", "crying" in model yaml file and fine-tune the decoder (or maybe fine-tune encoder for better result) using your data. You can have a look at tutorial how to fine-tune the model with changing the label and freeze (or not) encoder
How can I use Nemofor training purposes? Moreover, I want to use the pipeline on device: Android/IoS, so kindly also guide me in that direction as well..
NeMo VAD model support ONNX export, you could find it in tutorial. And the whole VAD pipeline (including post-processing) is torch-jit-scriptable.
Hi @fayejf , thanks a lot for sharing me the resources. I have few considerations, right now VAD is a binary classification, and so in Inference Tutorials, we are seeing about Binarization. So, it would be great if you can share me some example/tutorial with multi-class classification. As right now, I am looking for following labels: Background/BabyLaughing/BabyCrying.. Moreover, as right now GPU is required for training MarbleNet based models, can I later deploy them on CPU based hardware, as I am planning to have mobile Deployment later. And even, in start I will go for serverless deployment using AWS Lambda, and we don't have GPU option there, and max memory we can have is 10 GB, so will it be ok for these requirements...
Thanks for all your help
Hi @fayejf Here you can see, we are making manifest file for binary classification,
How, will we handle multi-class classification here..?
Thanks for your help..
@m-ali-awan Binarization is used to VAD postprocessing to convert list of speech probability such as [0.6, 0.8, 0.1, 0...] to speech timestamp [[start0, end0], [start1,end1]]
You would not need binarization for multiclassification, say you have three classes. [speech, background, cry]
and the output prob will be like [0.1, 0.1, 0.8] after softmax and you could just consider this as a cry segment. After model inference, you would see [cry, cry, cry, background,background,background,cry,cry,cry]
you could consider extend binarization/post-processing to multi-class but it's not required.
I have checked run the inference tutorial (load model, microphone input,) on a machine without GPU and it worked. But I'm not familiar with AWS Lambda or deployment environment. Maybe check ONNX?
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
Hi all!
I am working on developing pipelines for Laughing, Crying etc. I want to achieve them, like VAD in pyannote/Nemo/silero pipeline. How can I use Nemofor training purposes? Moreover, I want to use the pipeline on device: Android/IoS, so kindly also guide me in that direction as well..
Thanks