NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.94k stars 2.48k forks source link

Speaker Biometrics #2368

Closed CaioRochaGoes closed 3 years ago

CaioRochaGoes commented 3 years ago

Hi, how are you? will you be able to help me with this question... About the "Speaker Recognition Verification", after all the training process you give us the option to extract the Embeddings from the speakers, but we would like to know if we can extract the speaker ID, we also look at your other scripts the ones from "Speaker Diarization Inference" and "ASR with Speaker Diarization" which also don't reach the point we want, would you have an example of how I can get a speaker's biometrics where I can calculate the probability rate of that speaker who is speaking in the audio really be himself?

nithinraok commented 3 years ago

Hi, Please look at https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_recognition/speaker_reco_infer.py , we also explained what other scripts do at https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_recognition/README.md

CaioRochaGoes commented 3 years ago

Hi thanks for your answer, I used the script which you recommended to me https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_recognition/speaker_reco_infer.py, but I ended up getting a dubious result image at the end when he generated the embeddings_manifest_infer.json file, he considered all the audios of different labels as if they were only from a single speaker in the "Infer" field, he was expecting it to return the "Infer" field as the audios label, if it was correct or if it was wrong he would at least bring it alternately (between right and wrong), so I would have more conviction that he was really trying to do the biometrics.

This embeddings_manifest_infer.json return that I mentioned is due to some wrong parameter passing of mine or some previous configuration I have to do?

Follows passed parameters:

!python /content/speaker_reco_infer.py --spkr_model="/content/nemo_experiments/SpeakerNet/SpeakerNet.nemo" --train_manifest="/content/data/an4/wav/an4_clstk/train.json" --test_manifest="/content/data/an4/wav/an4_clstk/dev.json"

I used the An4 base to do this test, I used my own base too and got similar results.

I got the same results when I switched --test_manifest="/content/data/an4/wav/an4_clstk/dev.json" to >>> test_manifest="/content/embeddings_manifest.json"

Both bases I used the script "Speaker_Recogniton_Verification" for training and for the refinement process.

Tested with trainings and refinements of 5, and 70 epochs

nithinraok commented 3 years ago

an4 is a toy dataset, which I used to demonstrate the training process, so I would not recommend you to use it unless you carefully change the architecture to avoid overfitting.

Regarding script: This is the process to follow:

  1. If you want to use an4, decrease architecture size. For other datasets use a pre-trained model and finetune.
  2. Once you have trained the model, use this script and pass the trained model, the manifest file used for training (this is used to map the speaker labels), and test manifest file (for which you would want to test the model)
  3. Output test labels will be saved to json file with its inference labels appended.
CaioRochaGoes commented 3 years ago

Good afternoon.

Understood. I'll tell you my need to see if you can help me. I need to identify in a telephone recording if the speaker who is speaking is a person we have trained.

The script I ran above, but I wasn't able to include a sample of my phone recording, to locate during training (We did a training with our speakers)