NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.01k stars 2.5k forks source link

How to do scratch training or finetuning for Speaker diarization on my own dataset #4080

Closed saumyaborwankar closed 2 years ago

saumyaborwankar commented 2 years ago

Please help, I was able to find documentation regarding making my own dataset in nemo format but couldnt find any material to use that dataset to train a model or finetune a pretrained model.

nithinraok commented 2 years ago

Are you planning to train a speaker model or VAD model or both for speaker diarization?

saumyaborwankar commented 2 years ago

Ideally i want to train both.

nithinraok commented 2 years ago

For speaker embedding: Prepare your manifest using these steps: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speaker_recognition/datasets.html#all-other-datasets

Then Prepare the hydra configuration file as detailed here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speaker_recognition/configs.html#dataset-configuration

You may refer to this section as well on how to use the training script with hydra configuration file.

For VAD: Prepare your manifest using these steps: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speech_classification/datasets.html#speech-command-freesound-for-vad

Then Prepare the hydra configuration file as detailed here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speech_classification/configs.html

Then use this script to train a VAD model.

Inference for speaker diarization:

For speaker diarization inference using trained models refer to these steps.