NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.17k stars 2.53k forks source link

Speaker Verification #4706

Closed iddqd2d closed 2 years ago

iddqd2d commented 2 years ago

Hi!

I trained the model using: https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb

import nemo.collections.asr as nemo_asr
from omegaconf import OmegaConf
import torch
import pytorch_lightning as pl
import os

speaker_model = nemo_asr.models.EncDecSpeakerLabelModel.load_from_checkpoint('/home/denis/ttt/result/TitaNet/2022-08-05_08-47-00/checkpoints/TitaNet--val_loss=2.3548-epoch=9-last.ckpt')
#decision = speaker_model.verify_speakers('/home/denis/ttt/data/an4/wav/an4_clstk/mjes/cen7-mjes-b.wav','/home/denis/ttt/data/an4/wav/an4_clstk/mjes/cen2-mjes-b.wav')
decision = speaker_model.verify_speakers('/home/denis/ttt/data/an4/wav/an4_clstk/mjes/cen7-mjes-b.wav','data/an4/wav/an4test_clstk/fcaw/an406-fcaw-b.wav')

print(decision)

It works!

I can compare two files. Is there a method to compare all my speakers (many files) with another single file? Do I need to use a loop or another method?

nithinraok commented 2 years ago

You can use this function to get batch embeddings https://github.com/NVIDIA/NeMo/blob/4cd9b3449cbfedc671348fbabbe8e3a55fbd659d/nemo/collections/asr/models/label_models.py#L420

Once you get embeddings you can compare those embeddings using cosine similarity score. For example, you can view this script to see how its done: https://github.com/NVIDIA/NeMo/blob/4cd9b3449cbfedc671348fbabbe8e3a55fbd659d/examples/speaker_tasks/recognition/voxceleb_eval.py#L73

iddqd2d commented 2 years ago

Senks. It works! I have two audio files (same speaker). I get embeddings for each audio file in one embedding file (embeddings.pkl). Do I should merge embeddings if it`s same speaker? Sorry, for a stupid question.

How duration(seconds) the audio file should be?

nithinraok commented 2 years ago

Depends on your use case. You could try averaging cosine scores or average embeddings of each utterance per speaker ( if you have many samples per speaker).

There is no constraint on the duration of the file, it can fall in the range of (1 sec, 20 sec] or more than that

iddqd2d commented 2 years ago

Another stupid question. Here are the embeddings of the same speaker. How to average or combine them? Give me an example, please

an4_clstk@mjes@an158-mjes-b.wav
[-7.7393e-02  7.1899e-02 -1.5686e-02  2.3895e-02  2.4094e-02  2.8503e-02
  1.6495e-02  6.8893e-03  4.8065e-02 -1.1658e-02  7.3853e-02  7.6904e-02
 -3.3722e-02 -2.4994e-02  8.6426e-02  8.9493e-03 -3.0289e-02 -1.7603e-01
 -2.8412e-02 -1.1163e-01 -1.4580e-02  5.7373e-02 -6.9519e-02 -1.2688e-02
  6.5857e-02 -5.6091e-02  2.1057e-02 -8.9600e-02 -2.0309e-02 -1.7685e-02
 -1.5759e-01  5.5298e-02  5.1880e-02  1.0577e-01 -4.3427e-02 -1.8661e-02
  3.4790e-02 -2.6215e-02  5.2917e-02 -8.8562e-02 -7.4341e-02 -4.7485e-02
 -3.2043e-02 -3.3203e-02 -8.2153e-02  5.3162e-02 -9.3628e-02  4.1733e-03
 -4.2725e-02 -5.4565e-02  6.8420e-02 -5.7190e-02  1.5507e-03 -1.0358e-01
  1.9092e-01  1.3824e-02 -1.2527e-02  2.7069e-02  7.2693e-02  5.3375e-02
 -4.8767e-02  1.0223e-01 -8.2626e-03  9.0759e-02  2.4155e-02  1.8036e-02
  1.8860e-02  4.6936e-02  1.0376e-02  5.4077e-02 -9.9060e-02 -7.7534e-04
  1.1310e-01  1.7290e-03  5.2185e-03 -5.7159e-02  4.2603e-02 -3.7689e-02
  7.6538e-02  2.5711e-02 -8.7524e-02  1.5388e-02  5.1758e-02  8.6365e-02
 -1.3733e-01 -1.2161e-02  6.2622e-02 -1.2561e-01  1.0175e-01 -1.5732e-02
  4.3030e-03  7.7637e-02  8.4991e-03  1.4913e-04  6.3721e-02 -3.8788e-02
 -5.3062e-03 -6.8237e-02  1.9775e-01  5.1941e-02 -8.9111e-02  6.6284e-02
 -2.0782e-02 -3.0121e-02 -1.4313e-02  5.2185e-03  1.0602e-01 -1.1987e-01
  4.7302e-02  4.9011e-02  1.7944e-01 -6.1951e-02 -2.2217e-02  5.2734e-02
  9.0637e-02 -4.0100e-02 -1.0185e-02 -1.3420e-02 -1.5211e-03 -4.0344e-02
  4.9255e-02 -1.2733e-02 -1.5100e-01 -1.9763e-01  6.9763e-02 -9.1309e-02
 -7.5722e-03 -1.3904e-01  1.6602e-02 -9.9976e-02 -6.8726e-02 -6.7749e-03
  1.9882e-02  4.7241e-02  2.9587e-02 -1.3049e-01 -6.9702e-02 -2.0386e-01
 -6.1188e-02 -1.0712e-02 -1.6006e-02 -8.2397e-02 -9.3384e-02 -1.1299e-02
 -1.5540e-01 -2.9129e-02  1.4252e-02  6.0425e-02 -6.0791e-02 -3.9062e-02
  4.9561e-02  9.6436e-03  9.6130e-03  4.5654e-02  4.0558e-02 -5.9937e-02
 -5.4291e-02  4.0894e-02  1.3390e-02  4.1580e-03 -8.9172e-02 -5.7465e-02
 -1.1377e-01 -1.2283e-02  3.7518e-03  9.0088e-02 -4.4189e-02  1.0181e-01
  1.4465e-01  7.9407e-02  1.6272e-01 -4.6051e-02 -4.8065e-02  5.6702e-02
 -2.6337e-02 -4.7485e-02  1.4514e-01  1.3359e-02 -5.5008e-03  3.1921e-02
 -1.6406e-01 -3.9597e-03 -3.4424e-02  6.3049e-02 -5.2002e-02  8.5083e-02
 -8.6212e-03 -1.0583e-01 -4.4136e-03  7.3730e-02 -1.3281e-01  7.2327e-03]
an4_clstk@mjes@an156-mjes.wav
[-0.09344    0.08215    0.003399  -0.0253     0.06885    0.03345
 -0.01657   -0.01843   -0.007008  -0.0709     0.0504     0.127
 -0.01033   -0.04016    0.04947    0.02902   -0.01639   -0.11926
 -0.01955   -0.0529    -0.04865    0.06335   -0.03406   -0.09686
  0.1472    -0.03247   -0.01927    0.0164    -0.009026   0.011894
 -0.1614    -0.0192     0.03717    0.11725   -0.06158   -0.04156
  0.13      -0.01598    0.03552   -0.07825   -0.0834    -0.06055
 -0.0801    -0.000677  -0.04745    0.0804    -0.0946    -0.009125
 -0.066     -0.05225    0.01304   -0.06027    0.0992    -0.1227
  0.1426    -0.02565   -0.0541    -0.001242   0.0856    -0.0356
 -0.03918    0.06076   -0.05447    0.03375   -0.00906    0.02576
 -0.02682    0.1121     0.04538    0.1519    -0.08435   -0.1095
  0.1168     0.00888    0.02394    0.04117    0.012436   0.01723
  0.1125    -0.01991   -0.0914    -0.01188    0.03168    0.03732
 -0.1384    -0.044      0.0551    -0.093      0.05374   -0.02217
 -0.003479   0.001745   0.02647    0.03424    0.08636   -0.02934
  0.03766   -0.11365    0.1236     0.0417    -0.0258     0.06604
 -0.0696    -0.0324     0.01909    0.001274   0.1032    -0.1181
 -0.05035    0.09766    0.1595    -0.1442    -0.0521     0.004784
  0.11255    0.011505  -0.05356    0.0358     0.00988   -0.002363
 -0.06055    0.02724   -0.1447    -0.2079     0.1046    -0.1378
 -0.0439    -0.0968     0.063     -0.0155    -0.1099    -0.00885
  0.0004249  0.0672     0.0638    -0.1141    -0.0401    -0.10675
 -0.002323   0.01955   -0.0448    -0.0671    -0.0749     0.03134
 -0.0753    -0.07947    0.05814    0.02565   -0.004845  -0.01746
  0.04117   -0.05093   -0.0349    -0.01585    0.03647   -0.067
 -0.03096    0.05692    0.011734  -0.0432    -0.06354   -0.0192
 -0.05814   -0.05106    0.07306    0.08093    0.001845   0.04974
  0.1781     0.08527    0.1061    -0.09827   -0.01003    0.1543
  0.04852   -0.05978    0.089      0.0758     0.01471   -0.0127
 -0.1364    -0.0579     0.00239    0.02454   -0.07983    0.05618
 -0.08746   -0.1178    -0.0962     0.0387    -0.122      0.02303  ]

For speaker verification, it is better to use embeddings of short or long audio duration ?

iddqd2d commented 2 years ago

Hi! Should I take the average of the first element of the first array and the first element of the second array?

For speaker verification, it is better to use embeddings of short or long audio duration ?

nithinraok commented 2 years ago

For the above example, both are 192-dimensional vectors you can average along this dimension. You would get a 192-dimensional embedding.

There is no constraint on the duration of the file, it can fall in the range of (1 sec, 20 sec] or more than that. On average you can take about 5 sec

iddqd2d commented 2 years ago

first element from an4_clstk@mjes@an158-mjes-b.wav : -7.7393e-02
first element from an4_clstk@mjes@an156-mjes.wav : -0.09344 Should I : (-7.7393e-02 + -0.09344) / 2, and put to another array?

iddqd2d commented 2 years ago

Help with averaging please

nithinraok commented 2 years ago

yes add both arrays, and result will be 192 dimensional embedding