Closed iddqd2d closed 2 years ago
You can use this function to get batch embeddings https://github.com/NVIDIA/NeMo/blob/4cd9b3449cbfedc671348fbabbe8e3a55fbd659d/nemo/collections/asr/models/label_models.py#L420
Once you get embeddings you can compare those embeddings using cosine similarity score. For example, you can view this script to see how its done: https://github.com/NVIDIA/NeMo/blob/4cd9b3449cbfedc671348fbabbe8e3a55fbd659d/examples/speaker_tasks/recognition/voxceleb_eval.py#L73
Senks. It works! I have two audio files (same speaker). I get embeddings for each audio file in one embedding file (embeddings.pkl). Do I should merge embeddings if it`s same speaker? Sorry, for a stupid question.
How duration(seconds) the audio file should be?
Depends on your use case. You could try averaging cosine scores or average embeddings of each utterance per speaker ( if you have many samples per speaker).
There is no constraint on the duration of the file, it can fall in the range of (1 sec, 20 sec] or more than that
Another stupid question. Here are the embeddings of the same speaker. How to average or combine them? Give me an example, please
an4_clstk@mjes@an158-mjes-b.wav
[-7.7393e-02 7.1899e-02 -1.5686e-02 2.3895e-02 2.4094e-02 2.8503e-02
1.6495e-02 6.8893e-03 4.8065e-02 -1.1658e-02 7.3853e-02 7.6904e-02
-3.3722e-02 -2.4994e-02 8.6426e-02 8.9493e-03 -3.0289e-02 -1.7603e-01
-2.8412e-02 -1.1163e-01 -1.4580e-02 5.7373e-02 -6.9519e-02 -1.2688e-02
6.5857e-02 -5.6091e-02 2.1057e-02 -8.9600e-02 -2.0309e-02 -1.7685e-02
-1.5759e-01 5.5298e-02 5.1880e-02 1.0577e-01 -4.3427e-02 -1.8661e-02
3.4790e-02 -2.6215e-02 5.2917e-02 -8.8562e-02 -7.4341e-02 -4.7485e-02
-3.2043e-02 -3.3203e-02 -8.2153e-02 5.3162e-02 -9.3628e-02 4.1733e-03
-4.2725e-02 -5.4565e-02 6.8420e-02 -5.7190e-02 1.5507e-03 -1.0358e-01
1.9092e-01 1.3824e-02 -1.2527e-02 2.7069e-02 7.2693e-02 5.3375e-02
-4.8767e-02 1.0223e-01 -8.2626e-03 9.0759e-02 2.4155e-02 1.8036e-02
1.8860e-02 4.6936e-02 1.0376e-02 5.4077e-02 -9.9060e-02 -7.7534e-04
1.1310e-01 1.7290e-03 5.2185e-03 -5.7159e-02 4.2603e-02 -3.7689e-02
7.6538e-02 2.5711e-02 -8.7524e-02 1.5388e-02 5.1758e-02 8.6365e-02
-1.3733e-01 -1.2161e-02 6.2622e-02 -1.2561e-01 1.0175e-01 -1.5732e-02
4.3030e-03 7.7637e-02 8.4991e-03 1.4913e-04 6.3721e-02 -3.8788e-02
-5.3062e-03 -6.8237e-02 1.9775e-01 5.1941e-02 -8.9111e-02 6.6284e-02
-2.0782e-02 -3.0121e-02 -1.4313e-02 5.2185e-03 1.0602e-01 -1.1987e-01
4.7302e-02 4.9011e-02 1.7944e-01 -6.1951e-02 -2.2217e-02 5.2734e-02
9.0637e-02 -4.0100e-02 -1.0185e-02 -1.3420e-02 -1.5211e-03 -4.0344e-02
4.9255e-02 -1.2733e-02 -1.5100e-01 -1.9763e-01 6.9763e-02 -9.1309e-02
-7.5722e-03 -1.3904e-01 1.6602e-02 -9.9976e-02 -6.8726e-02 -6.7749e-03
1.9882e-02 4.7241e-02 2.9587e-02 -1.3049e-01 -6.9702e-02 -2.0386e-01
-6.1188e-02 -1.0712e-02 -1.6006e-02 -8.2397e-02 -9.3384e-02 -1.1299e-02
-1.5540e-01 -2.9129e-02 1.4252e-02 6.0425e-02 -6.0791e-02 -3.9062e-02
4.9561e-02 9.6436e-03 9.6130e-03 4.5654e-02 4.0558e-02 -5.9937e-02
-5.4291e-02 4.0894e-02 1.3390e-02 4.1580e-03 -8.9172e-02 -5.7465e-02
-1.1377e-01 -1.2283e-02 3.7518e-03 9.0088e-02 -4.4189e-02 1.0181e-01
1.4465e-01 7.9407e-02 1.6272e-01 -4.6051e-02 -4.8065e-02 5.6702e-02
-2.6337e-02 -4.7485e-02 1.4514e-01 1.3359e-02 -5.5008e-03 3.1921e-02
-1.6406e-01 -3.9597e-03 -3.4424e-02 6.3049e-02 -5.2002e-02 8.5083e-02
-8.6212e-03 -1.0583e-01 -4.4136e-03 7.3730e-02 -1.3281e-01 7.2327e-03]
an4_clstk@mjes@an156-mjes.wav
[-0.09344 0.08215 0.003399 -0.0253 0.06885 0.03345
-0.01657 -0.01843 -0.007008 -0.0709 0.0504 0.127
-0.01033 -0.04016 0.04947 0.02902 -0.01639 -0.11926
-0.01955 -0.0529 -0.04865 0.06335 -0.03406 -0.09686
0.1472 -0.03247 -0.01927 0.0164 -0.009026 0.011894
-0.1614 -0.0192 0.03717 0.11725 -0.06158 -0.04156
0.13 -0.01598 0.03552 -0.07825 -0.0834 -0.06055
-0.0801 -0.000677 -0.04745 0.0804 -0.0946 -0.009125
-0.066 -0.05225 0.01304 -0.06027 0.0992 -0.1227
0.1426 -0.02565 -0.0541 -0.001242 0.0856 -0.0356
-0.03918 0.06076 -0.05447 0.03375 -0.00906 0.02576
-0.02682 0.1121 0.04538 0.1519 -0.08435 -0.1095
0.1168 0.00888 0.02394 0.04117 0.012436 0.01723
0.1125 -0.01991 -0.0914 -0.01188 0.03168 0.03732
-0.1384 -0.044 0.0551 -0.093 0.05374 -0.02217
-0.003479 0.001745 0.02647 0.03424 0.08636 -0.02934
0.03766 -0.11365 0.1236 0.0417 -0.0258 0.06604
-0.0696 -0.0324 0.01909 0.001274 0.1032 -0.1181
-0.05035 0.09766 0.1595 -0.1442 -0.0521 0.004784
0.11255 0.011505 -0.05356 0.0358 0.00988 -0.002363
-0.06055 0.02724 -0.1447 -0.2079 0.1046 -0.1378
-0.0439 -0.0968 0.063 -0.0155 -0.1099 -0.00885
0.0004249 0.0672 0.0638 -0.1141 -0.0401 -0.10675
-0.002323 0.01955 -0.0448 -0.0671 -0.0749 0.03134
-0.0753 -0.07947 0.05814 0.02565 -0.004845 -0.01746
0.04117 -0.05093 -0.0349 -0.01585 0.03647 -0.067
-0.03096 0.05692 0.011734 -0.0432 -0.06354 -0.0192
-0.05814 -0.05106 0.07306 0.08093 0.001845 0.04974
0.1781 0.08527 0.1061 -0.09827 -0.01003 0.1543
0.04852 -0.05978 0.089 0.0758 0.01471 -0.0127
-0.1364 -0.0579 0.00239 0.02454 -0.07983 0.05618
-0.08746 -0.1178 -0.0962 0.0387 -0.122 0.02303 ]
For speaker verification, it is better to use embeddings of short or long audio duration ?
Hi! Should I take the average of the first element of the first array and the first element of the second array?
For speaker verification, it is better to use embeddings of short or long audio duration ?
For the above example, both are 192-dimensional vectors you can average along this dimension. You would get a 192-dimensional embedding.
There is no constraint on the duration of the file, it can fall in the range of (1 sec, 20 sec] or more than that. On average you can take about 5 sec
first element from an4_clstk@mjes@an158-mjes-b.wav : -7.7393e-02
first element from an4_clstk@mjes@an156-mjes.wav : -0.09344
Should I : (-7.7393e-02 + -0.09344) / 2, and put to another array?
Help with averaging please
yes add both arrays, and result will be 192 dimensional embedding
Hi!
I trained the model using: https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb
It works!
I can compare two files. Is there a method to compare all my speakers (many files) with another single file? Do I need to use a loop or another method?