NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
10.94k stars 2.28k forks source link

Train Speaker Diarization model #4533

Closed AMITKESARI2000 closed 1 year ago

AMITKESARI2000 commented 2 years ago

Hello, Can you guide me how to train and fine tune the Speaker Diarization model taught in https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb. I was unable to find any documentation on this Thanks

ValeryNikiforov commented 2 years ago

Hello, If you want to train speaker embedding model (not VAD), this guide will be helpful: https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb Note the "Speaker Verification" part, it's important to change loss

AMITKESARI2000 commented 1 year ago

Hi, Thanks for the reply! I checked out the notebook. I wanted to train Speaker Diarization on my custom dataset. Not able to reproduce the steps as they are a bit confusing. Could you please chalk out a rough path for me as to how to train Speaker Diarization (with non oracle VAD), or better still give a example in colab. I think it would help others as well, if proper steps are there.

nithinraok commented 1 year ago

Offline Speaker Diarization in NeMo is currently clustering based, so to train SD you need to train VAD and Speaker Embedding extractor as explained here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speaker_diarization/datasets.html

For training MSDD models.. documentation will be added soon.