Closed harrypotter90 closed 3 years ago
Same question.
There are mainly 3 speaker-related tasks:
Speaker Recognition is a task when you train a speaker recognition network with known speaker labels and test on the same known speaker labels. You can either train from scratch or use a pretrained model and finetune for your set of speaker labels. For this purpose, we provided speaker_reco.py example
Verification is a task where your speaker labels need not be part of your train set but you would just want to know if two given audio files are from the same set of speakers or not. For this purpose, we compare embedding extracted from one audio file (assuming they are from a single speaker) with embedding extracted from another audio file (may or may not be the same speaker but assumes a single speaker present). And this is exactly the purpose of spkr_get_emb.py. Embeddings are extracted based on manifest file provided and are saved in embeddings_dir, One can then use these embeddings to calculate cosine similarity to see if they are from the same speaker or not. You may refer to voxceleb_eval example where we provided script to evaluate on VoxCeleb trail files.
Speaker Diarization is the task of segmenting audio recordings by speaker label to know who spoke when. Diarization combines both speaker verification and speaker recognition steps to know when speakers take turns in an audio file. This can be achieved and explained using the example speaker_diarize.py.
Hoping this clears doubts regarding the usage of speaker recognition scripts.
For more detailed explanation refer to tutorials provided in https://github.com/NVIDIA/NeMo/tree/main/tutorials/speaker_recognition. Diarization tutorial will be added soon.
Sorry this may be a dumb question, but I got a little confused going through the tutorials, so the spkr.nemo
restored in the stage of extracting embeddings for speaker verification was supposed to be another verification model trained on another dataset instead of the speaker recognition model spkr.nemo
trained in the early steps in the tutorial?
btw I think get_hi-mia-data.py
could no longer be found in the repo.
As extracting embeddings and calculating cosine-similarity only allows verifying if two audios are from the same speaker, I am wondering if there is a way to predict labels instead? Since this was not explicitly explained in the tutorial or anywhere else in the example scripits.
Same question.
For extracting embeddings you could use any spkr.nemo
model, but if you train your model using Angular softmax loss then it performs better than cross entropy loss for verification purposes, hence it was suggested to train using angular softmax loss for verification using data very similar to your test audio samples.
You may use the script mentioned in this discussion to infer your labels: https://github.com/NVIDIA/NeMo/issues/1393 hi_mia_data_script -> https://github.com/NVIDIA/NeMo/blob/rir_noise_aug_2/scripts/get_hi-mia_data.py (( I will add it to the main branch)
Thanks, Nithin!
I have a question if you do not mind. How can I infer an audio file to extract its speaker embeddings. I have looked around and couldn't find an example where I only infer one audio file.
@OmarHory I think embeddings extraction and label inference are two separate tasks:
extracting embeddings is, in general, part of a speaker verification workflow where one should train a speaker verification model, (to my understanding, the embeddings can only be used to calculate cosine similarity to verify if a pair of recordings come from the same speaker? I did an experiment about using the extracted embeddings here
and label inference should be done after a speaker recognition model has been trained and saved.
If the goal is to infer the speaker label, one does not have to extract speaker embeddings, Nithin had created a script for inference here. A closed issue for reference about speaker inference is here.
With the script, the inferences will be written to a JSON file like this:
{"audio_filepath": "/Users/xujinghua/test_jxu/jxutest2.wav", "duration": 7.237369614512471, "label": "jxu", "infer": "mdmc"}
Renamed get_spkr_emb.py to extract_speaker_embeddings.py to make things clear Added speaker_reco_infer.py script on request. @OmarHory please refer to relevant docstrings in scripts. #1793
Renamed get_spkr_emb.py to extract_speaker_embeddings.py to make things clear Added speaker_reco_infer.py script on request. @OmarHory please refer to relevant docstrings in scripts. #1793
Exactly what I needed, much thanks! @nithinraok
I have been using https://github.com/NVIDIA/NeMo/tree/main/tutorials/speaker_recognition.
There is a way we can get embeddings for speaker recognition. (https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_recognition/spkr_get_emb.py)
But didn't find any information, how to use those embeddings for the speaker recogniton.
Do I need to build something else on top of it, do I do cosine similarity?
Please advice.