Closed ali2iptoki closed 3 years ago
angular=True changes the loss from softmax CrossEntrophy to softmax angular margin loss, which is what essential difference in training a identification and verification models. See paper for more info.
https://github.com/NVIDIA/NeMo/blob/v1.0.2/examples/speaker_recognition/speaker_reco.py is an example script to understand the parameters on how to run the script. Set num classes as the number based on number of classes in your dataset
@nithinraok Thanks for your quick reply. to resume:
My other concern if you can give any clarification please: SpeakerNet is able to generate an emebdding to even a new user not seen during training:
model_name = "path to/SpeakerNet.nemo" # a model a saved using this name
embs = restored_model_tuned.get_embedding('path to .wav file/ali.wav')
embs
How can interpret this embedding. More precisely is there is any way to see for example the confidence of this vector in each of the classes in the training dataset. I expect to get (let say we have 5 classes) a vector of dimension 5 where each is equal value is the "probability" to belong to the corresponding class.
Can speakerNet be used to do text-dependant speaker verification or it is only text-independant?
The idea of speaker verification is to get speaker characteristic embeddings, it may be from the seen or unseen speakers.
if you have trained on known speaker labels and if you would like to know the likelihood of the unknown speaker from known speaker labels then you may refer to this inference script
we trained speakernet on text-independent audio samples, we haven't trained a model with text-dependent samples. That is we haven't finetuned to differentiate samples from the same speaker speaking different sentences (you could try this as postprocessing along with ASR).
@nithinraok if it is unseen, the inference script will not works. I get
362
363 if not self.is_regression_task:
--> 364 t = torch.tensor(self.label2id[sample.label]).long()
365 else:
366 t = torch.tensor(sample.label).float()
KeyError: 'ali'.
did you train model with one of the label as ali
?
@nithinraok I trained a model on 10 speakers which not include the label "ali"
Are you making sure you are running this script (https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_recognition/speaker_reco_infer.py)? Please read the script clearly.
@nithinraok Yes I trained the model using this script on an4
, then I fine tune this model on an4test_clstk
where I keep out one of the speaker folder outside this an4test_clstk
dataset. So,an4test_clstk
includes now only 9 users. So the finetune model has 9 classes.
for the speaker (i.e. a directory including several .wav) i leaved out i take one .wav file from and i get its embedding as:
model_name = "path to the tuned model/SpeakerNet.nemo"
embs = restored_model_tuned.get_embedding('/data/an4/wav/fcaw/an406-fcaw-b.wav')
embs
So I get :
tensor([[ 0.0628, 0.8158, 1.0182, -1.5353, -0.3223, -1.0309, -0.4724, -0.3431, -1.5693, 0.1261, 0.0043, 0.8729, -0.3124, -1.2042, 0.5499, -0.7432, 1.3909, 0.9304, -0.3645, -0.7780, -0.4095, 1.2481, -1.0341, -0.1919, -0.1197, 0.1906, 1.0383, -0.3214, -0.9363, 0.7387, 1.0696, 1.5287, -1.3526, -0.8538, 0.3797, 0.8821, -0.7222, -0.3843, 0.7459, 0.9129, -1.3171, 2.0432, -0.4796, 0.2556, -0.9999, -0.5811, -1.0879, 0.1496, 0.4270, 0.5138, 0.2044, -0.0226, -1.3580, 0.5890, -1.8492, -0.1540, -0.1218, -1.1495, -0.9455, -1.2832, 0.2307, 1.4360, -0.6681, -0.5309, 1.0901, -0.5038, -0.9066, 0.3398, 1.2139, 0.0351, -0.3148, 1.4173, -0.0249, 0.0476, 0.5739, 0.7722, -0.2927, 1.8214, 1.3135, -0.2221, -0.8521, 0.9746, -0.8278, -1.2059, -2.3147, 0.7215, 0.4020, 0.3031, -0.2162, -0.3377, -1.3632, -1.6063, 0.2243, 0.3508, -0.7327, -0.3402, -0.0525, 1.1609, 1.9040, 0.0563, 1.5826, 0.8405, 0.7513, 0.8623, 1.4027, 1.1775, -1.0972, 0.9596, 1.9852, 0.1637, -0.8007, -0.6635, 1.8842, 0.6261, -0.1851, -0.2628, 0.4175, 0.6944, -1.4411, 0.3121, -0.5813, 0.4987, -0.7594, -0.6640, -0.1867, 0.3791, 0.5758, -0.2246, 0.5192, -0.4775, 0.3416, -0.7758, 0.7634, -0.0862, 0.4723, -0.5569, -0.0256, 0.0503, -1.0169, -0.1003, 0.5141, 2.1522, -0.3198, 0.4960, 0.1381, 0.6249, -0.3770, 0.0555, 0.4514, 0.7407, 0.5322, 0.6873, -0.1153, 0.9056, -0.6777, 0.3232, 1.3341, -0.1190, -0.9058, 0.2725, -0.2453, 0.4950, -0.2520, -0.1055, 0.9721, -0.0037, 0.8440, 0.5853, -1.0333, -0.2512, -0.4601, 0.5693, -2.5064, -0.0537, -0.8892, 1.8675, -2.1216, 0.3092, 0.6009, -2.4113, -0.5679, 1.8963, 0.1143, -0.9205, 0.3332, -0.9309, -0.4478, 0.2629, -1.2584, 0.0832, 0.0703, -0.9151, -0.3628, -0.3248, 1.0698, 1.5338, -2.2140, 0.5097, -1.4500, -0.5145, -0.2576, 1.3373, 0.2425, 0.6314, -0.2355, -0.5822, -0.3950, -0.4419, -0.6131, 0.2085, -0.7185, 0.4203, 0.3511, 0.2807, 1.3838, -0.0275, -0.6528, 1.3631, 0.0490, -0.7483, -0.1411, 1.1782, 0.8723, 0.0349, -1.2281, -1.1216, -1.2506, 0.0251, 0.2119, -0.8995, 1.8134, -1.9559, -0.1287, -1.5117, -0.2602, 0.7671, 0.5615, -0.6297, -1.3117, -0.2834, 0.3295, -1.1493, -0.7166, -0.5089, 0.1731, -0.6314, 1.2046, -0.4664, -2.4469, 0.8355, 1.3662, -1.9601, -0.0029, 2.4909, -1.0977, 1.6115, -1.2090, 0.1100, 0.5226, -0.9079, -0.7301, -0.8700, -0.4344, 0.2953, 0.5744, 0.0756, 0.5443, -1.1131, 0.2120, 0.1031, -0.0898, -1.4411, -1.0233, -1.1302, 0.2796, 0.1526, -0.5348, 1.0946, 0.0495, 1.4064, 0.2829, -0.4838, 1.5434, -0.1496, 0.2943, 0.8958, 0.1034, 1.0696, 2.0203, -0.2404, -0.1914, -0.1386, 0.2784, 0.2057, -0.5085, 0.0162, 0.4139, -0.9091, -1.0556, -0.2362, -1.5062, 1.0594, -1.6849, -0.1765, -1.3083, -1.0235, -0.0536, 0.9683, 0.2545, 0.6179, -0.4821, 0.4345, -0.4641, 0.3064, 0.5140, 0.8807, -0.6108, 0.5328, 1.2156, -1.2854, -0.4222, 1.8813, 0.8873, -0.3094, 0.4829, -0.0651, -1.8508, 1.6000, -0.0474, 1.5053, 0.0736, 0.4032, -0.2092, 0.3491, -0.1942, 0.3481, 1.5789, 1.4165, -1.7132, 1.1152, 1.6150, 2.1679, 0.8146, 0.0437, -0.9857, 0.7709, 0.1608, -1.1367, -0.5146, 1.4041, -2.2360, 0.5758, -0.7333, -0.1009, 0.9430, 0.6600, -0.4132, -0.2258, 0.9846, 0.3689, -0.8209, -0.2413, -0.4222, 0.5732, -0.0594, 0.3323, -0.3429, -1.1169, 1.7542, 0.5960, 0.1414, -0.3205, -0.9614, 1.0233, 0.2286, -0.9375, -0.8822, 1.5812, -0.0318, -0.2842, 1.4292, 0.2807, 0.3899, 0.6885, -0.3253, 0.1750, 0.9151, -0.7703, -0.6582, 0.3794, 1.1303, -1.5559, -0.4148, 0.1063, -0.9048, -0.1196, -0.0712, 1.5589, 1.8074, 0.3628, 1.0805, -1.2641, -0.5647, -2.2094, -0.0438, -1.2688, -0.2659, 1.0689, -0.1788, 0.5957, -0.0506, 1.0473, 0.0565, -0.2183, -0.7412, -0.1839, -0.9485, -0.7508, -1.6509, 2.3409, 0.8754, 0.6159, 1.5166, 1.0600, 1.0329, 0.8794, 1.4565, 0.2864, 1.1625, -0.9837, -1.7411, 1.6233, -1.9482, -0.3844, -0.2919, -0.3689, 0.6861, -0.4511, -0.3608, 0.6986, 1.6351, -0.8251, 0.1964, 0.3781, -0.1032, -0.5941, 0.9788, 0.5949, 0.6924, -0.2773, -1.4484, -0.7374, -0.5549, -1.0321, 0.3914, -0.2236, -0.4248, 1.4976, 0.4575, 0.5289, 1.0058, -0.2888, -0.8227, 0.8932, 0.8596, 0.1890, -1.5829, 0.9727, 0.2821, 1.3765, 1.7369, -0.2524, -2.1842, -0.4125, 0.1442, 1.6489, 0.3097, 1.2120, 0.4387, 0.4237, -0.1217, -0.7276, 0.9319, -1.2512, 0.2688, -0.4336, -0.8974, -1.5635, 0.3878, -0.7559, 0.9945, 2.7519, -1.0840, 1.0679, -0.0068, 1.7468, -0.9091, 0.2995, 0.4645, -1.1787, -1.8900, 0.0734, 0.0973, 0.3678, 0.6867, 0.1480, 0.6282, 0.4516, -0.3070, -0.9274, -0.0738, -1.9139]], device='cuda:0')
.
Then to infer the labels of this speaker I prepare an unk.json file:
{"audio_filepath": "/data/an4/wav/fcaw/an406-fcaw-b.wav", "duration": 2.2, "label": "fcaw"}
and then in the script i did:
test_manifest = "/data/an4/wav/fcaw/unk.json"
speaker_model.setup_test_data(
test_data_layer_params={
'sample_rate': 16000,
'manifest_filepath': test_manifest,
'labels': labels_map,
'batch_size': 32,
'trim_silence': False,
'shuffle': False,
}
)
Then
stime = time.time()
for test_batch in tqdm(speaker_model.test_dataloader()):
if can_gpu:
test_batch = [x.cuda() for x in test_batch]
with autocast():
audio_signal, audio_signal_len, labels, _ = test_batch
logits, _ = speaker_model.forward(input_signal=audio_signal, input_signal_length=audio_signal_len)
logging.info( logits.cpu().detach().numpy())
logging.info(type(all_logits))
if can_gpu:
all_logits.extend(logits.cpu().detach().numpy())
all_labels.extend(labels.cpu().numpy())
else:
all_logits.extend(logits.detach().numpy())
all_labels.extend(labels.detach().numpy())
etime= time.time()
timer(stime, etime)
And the error is raised:
362
363 if not self.is_regression_task:
--> 364 t = torch.tensor(self.label2id[sample.label]).long()
365 else:
366 t = torch.tensor(sample.label).float()
KeyError: 'fcaw'
can you show me where these lines
stime = time.time() for test_batch in tqdm(speaker_model.test_dataloader()):
if can_gpu: test_batch = [x.cuda() for x in test_batch]
with autocast(): audio_signal, audio_signallen, labels, = testbatch logits, = speaker_model.forward(input_signal=audio_signal, input_signal_length=audio_signal_len) logging.info( logits.cpu().detach().numpy()) logging.info(type(all_logits)) if can_gpu:
all_logits.extend(logits.cpu().detach().numpy()) all_labels.extend(labels.cpu().numpy()) else: all_logits.extend(logits.detach().numpy()) all_labels.extend(labels.detach().numpy())
etime= time.time() timer(stime, etime)
and also:
if not self.is_regression_task: --> 364 t = torch.tensor(self.label2id[sample.label]).long() 365 else: 366 t = torch.tensor(sample.label).float()
are part of https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_recognition/speaker_reco_infer.py
@nithinraok This lines I took from your script in inference file and added them in my notebook. If I test of a user already it is labels seen in the training the inference works but if the label was not in see it will generate the error above but is able to generate its embedding vector. SOme line are added to just debug but the big lines are from your script.
Indeed, in https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_recognition/speaker_reco_infer.py I think you need to correct line 21: --test_manifest=/path/to/train/manifest/file'
to be --test_manifest=/path/to/test/manifest/file'
Isn't that obvious? How do you expect to get a label for the speaker which is not part of the training labels but only to be part of inference labels? Also, I suggest you go through the script very clearly on how training and inference labels mapped. Yes, it's the path to test manifest file (argument name)
You may have to understand speaker-related tasks in more depth, on what is the difference between speaker identification, verification, and diarization. Hopefully, this helps: https://github.com/NVIDIA/NeMo/issues/1710#issuecomment-776261922 . You may also need to understand from the literature, how these characteristic speaker embeddings are useful in differentiating one unseen speaker from other.
@nithinraok yes it is obvious, but i was confused that the model are able to generate embedding for a very new .wav file for a user not the training set. So, usually, if the label of a user not in the training set the model shoudl not generate an embedding.
@nithinraok the link spkr_get_emb.py in #1710 (comment) is not working. can you please verify. I appreciate your help.
Oh, its extraction of speaker embeddings script.
@nithinraok Using the predefined model Speaker recognition :
model = PretrainedModelInfo(
pretrained_model_name="SpeakerNet_recognition",
location="https://api.ngc.nvidia.com/v2/models/nvidia/nemospeechmodels/versions/1.0.0a5/files/SpeakerNet_recognition.nemo",
description="SpeakerNet_recognition model trained end-to-end for speaker recognition purposes with cross_entropy loss. It was trained on voxceleb 1, voxceleb 2 dev datasets and augmented with musan music and noise. Speaker Recognition model achieves 2.65% EER on voxceleb-O cleaned trial file",
)
In the description they mention voxceleb-O
please where can I find file. It is equivalent to which file in
https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html under Dataset split for identification.
Indeed, please: in reco_infer.py the following files are also required:
train_manifest and test file. So please correct me: the test file will be the corresponding of voxceleb-O
in json format. So where can I find all these files please.
VoxCeleb-O is a trial file that is used for verification purposes. Link. It is not a manifest file.
Manifest files are in JSON format that contains several rows of the train, valid, and test samples with keys audio_filepath, offset, duration, and label in each row. Look at speaker recognition training tutorial on how we create sample training manifest file.
Note: Speaker Recognition, Verification, and Diarization are all three different tasks, you may have to keenly understand them before you refer to any of the scripts, else it will only add you to confusion.
@nithinraok Thanks for your hints. I have a question regarding the values generated by the script speaker_reco_infer.py
. I get, for example for an4, the following results (I will show the first speaker results):
all_logits[0] = [-1.995 1.678 -2.535 1.739 -1.728 -1.268 -0.727 -3.385 -2.348
-3.021 0.5293 -0.4573 0.5137 -3.047 -4.75 -1.847 2.922 -0.989
-1.507 -0.9224 -2.545 6.957 0.9985 -2.035 -3.234 -2.848 -1.971
-3.246 2.057 -1.991 -6.27 9.22 0.4045 -2.703 -1.577 4.066
7.215 -4.07 12.98 -3.02 1.456 9.44 6.49 0.272 2.07
1.625 -3.531 -2.846 -4.914 -0.536 -3.496 -1.095 -2.719 -0.5825
5.535 -0.1753 3.658 4.234 4.543 -0.8384 -2.705 -2.012 -6.56
10.5 -2.021 -2.48 1.725 5.69 3.672 -6.855 -3.887 1.761
6.926 -4.848 ]
where each value represent "how much" the given speaker belongs to the corresponding label (i.e. a speaker from the 74 existing speakers or labels). In this script you select the top1 as the label to infer.
scaler = MinMaxScaler()
all_logits = scaler.fit_transform(all_logits)
all_logits = normalize(all_logits, norm='l1', axis=1, copy=True)
The result completely change. We get a higher accuracy (for example on an4)
It is a bug but more a clarification that I need to understand the speakerNet model. I need to do speaker verification using speakerNet. So I follow the tutorial https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_recognition/Speaker_Recognition_Verification.ipynb .
In this tutorial you suggest to only change
config.model.decoder.angular = True
when we want to do speaker verification. So, I need to confirm if that not other parameter need to be changed.My second concerns why in https://github.com/NVIDIA/NeMo/blob/v1.0.2/examples/speaker_recognition/speaker_reco.py you set the name of classes to 2. Is just an example or it must be always like that?