NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.06k stars 2.51k forks source link

Speaker recognition embeddings #1710

Closed harrypotter90 closed 3 years ago

harrypotter90 commented 3 years ago

I have been using https://github.com/NVIDIA/NeMo/tree/main/tutorials/speaker_recognition.

There is a way we can get embeddings for speaker recognition. (https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_recognition/spkr_get_emb.py)

But didn't find any information, how to use those embeddings for the speaker recogniton.

Do I need to build something else on top of it, do I do cosine similarity?

Please advice.

OmarHory commented 3 years ago

Same question.

nithinraok commented 3 years ago

There are mainly 3 speaker-related tasks:

Speaker Recognition:

Speaker Recognition is a task when you train a speaker recognition network with known speaker labels and test on the same known speaker labels. You can either train from scratch or use a pretrained model and finetune for your set of speaker labels. For this purpose, we provided speaker_reco.py example

Speaker Verification:

Verification is a task where your speaker labels need not be part of your train set but you would just want to know if two given audio files are from the same set of speakers or not. For this purpose, we compare embedding extracted from one audio file (assuming they are from a single speaker) with embedding extracted from another audio file (may or may not be the same speaker but assumes a single speaker present). And this is exactly the purpose of spkr_get_emb.py. Embeddings are extracted based on manifest file provided and are saved in embeddings_dir, One can then use these embeddings to calculate cosine similarity to see if they are from the same speaker or not. You may refer to voxceleb_eval example where we provided script to evaluate on VoxCeleb trail files.

Speaker Dirarization:

Speaker Diarization is the task of segmenting audio recordings by speaker label to know who spoke when. Diarization combines both speaker verification and speaker recognition steps to know when speakers take turns in an audio file. This can be achieved and explained using the example speaker_diarize.py.

Hoping this clears doubts regarding the usage of speaker recognition scripts.

For more detailed explanation refer to tutorials provided in https://github.com/NVIDIA/NeMo/tree/main/tutorials/speaker_recognition. Diarization tutorial will be added soon.

JINHXu commented 3 years ago

Sorry this may be a dumb question, but I got a little confused going through the tutorials, so the spkr.nemo restored in the stage of extracting embeddings for speaker verification was supposed to be another verification model trained on another dataset instead of the speaker recognition model spkr.nemo trained in the early steps in the tutorial?

JINHXu commented 3 years ago

btw I think get_hi-mia-data.py could no longer be found in the repo.

JINHXu commented 3 years ago

As extracting embeddings and calculating cosine-similarity only allows verifying if two audios are from the same speaker, I am wondering if there is a way to predict labels instead? Since this was not explicitly explained in the tutorial or anywhere else in the example scripits.

OmarHory commented 3 years ago

Same question.

nithinraok commented 3 years ago

For extracting embeddings you could use any spkr.nemo model, but if you train your model using Angular softmax loss then it performs better than cross entropy loss for verification purposes, hence it was suggested to train using angular softmax loss for verification using data very similar to your test audio samples.

You may use the script mentioned in this discussion to infer your labels: https://github.com/NVIDIA/NeMo/issues/1393 hi_mia_data_script -> https://github.com/NVIDIA/NeMo/blob/rir_noise_aug_2/scripts/get_hi-mia_data.py (( I will add it to the main branch)

JINHXu commented 3 years ago

Thanks, Nithin!

OmarHory commented 3 years ago

I have a question if you do not mind. How can I infer an audio file to extract its speaker embeddings. I have looked around and couldn't find an example where I only infer one audio file.

JINHXu commented 3 years ago

@OmarHory I think embeddings extraction and label inference are two separate tasks:

extracting embeddings is, in general, part of a speaker verification workflow where one should train a speaker verification model, (to my understanding, the embeddings can only be used to calculate cosine similarity to verify if a pair of recordings come from the same speaker? I did an experiment about using the extracted embeddings here

and label inference should be done after a speaker recognition model has been trained and saved.

If the goal is to infer the speaker label, one does not have to extract speaker embeddings, Nithin had created a script for inference here. A closed issue for reference about speaker inference is here.

With the script, the inferences will be written to a JSON file like this:

{"audio_filepath": "/Users/xujinghua/test_jxu/jxutest2.wav", "duration": 7.237369614512471, "label": "jxu", "infer": "mdmc"}
nithinraok commented 3 years ago

Renamed get_spkr_emb.py to extract_speaker_embeddings.py to make things clear Added speaker_reco_infer.py script on request. @OmarHory please refer to relevant docstrings in scripts. #1793

OmarHory commented 3 years ago

Renamed get_spkr_emb.py to extract_speaker_embeddings.py to make things clear Added speaker_reco_infer.py script on request. @OmarHory please refer to relevant docstrings in scripts. #1793

Exactly what I needed, much thanks! @nithinraok