NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.06k stars 2.51k forks source link

config.model.decoder.angular = True? #2582

Closed ali2iptoki closed 3 years ago

ali2iptoki commented 3 years ago

It is a bug but more a clarification that I need to understand the speakerNet model. I need to do speaker verification using speakerNet. So I follow the tutorial https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_recognition/Speaker_Recognition_Verification.ipynb .

In this tutorial you suggest to only change config.model.decoder.angular = True when we want to do speaker verification. So, I need to confirm if that not other parameter need to be changed.

My second concerns why in https://github.com/NVIDIA/NeMo/blob/v1.0.2/examples/speaker_recognition/speaker_reco.py you set the name of classes to 2. Is just an example or it must be always like that?

nithinraok commented 3 years ago

angular=True changes the loss from softmax CrossEntrophy to softmax angular margin loss, which is what essential difference in training a identification and verification models. See paper for more info.

https://github.com/NVIDIA/NeMo/blob/v1.0.2/examples/speaker_recognition/speaker_reco.py is an example script to understand the parameters on how to run the script. Set num classes as the number based on number of classes in your dataset

ali2iptoki commented 3 years ago

@nithinraok Thanks for your quick reply. to resume:

  1. Loss cross entropy function is used when doing Speaker Identification.​​
  2. Angular Softmax loss function is used when doing Speaker verification.​

My other concern if you can give any clarification please: SpeakerNet is able to generate an emebdding to even a new user not seen during training:

model_name =  "path to/SpeakerNet.nemo" # a model a saved using this name
embs = restored_model_tuned.get_embedding('path to .wav file/ali.wav')
embs

How can interpret this embedding. More precisely is there is any way to see for example the confidence of this vector in each of the classes in the training dataset. I expect to get (let say we have 5 classes) a vector of dimension 5 where each is equal value is the "probability" to belong to the corresponding class.

Can speakerNet be used to do text-dependant speaker verification or it is only text-independant?

nithinraok commented 3 years ago

The idea of speaker verification is to get speaker characteristic embeddings, it may be from the seen or unseen speakers.

if you have trained on known speaker labels and if you would like to know the likelihood of the unknown speaker from known speaker labels then you may refer to this inference script

we trained speakernet on text-independent audio samples, we haven't trained a model with text-dependent samples. That is we haven't finetuned to differentiate samples from the same speaker speaking different sentences (you could try this as postprocessing along with ASR).

ali2iptoki commented 3 years ago

@nithinraok if it is unseen, the inference script will not works. I get

   362 
    363         if not self.is_regression_task:
--> 364             t = torch.tensor(self.label2id[sample.label]).long()
    365         else:
    366             t = torch.tensor(sample.label).float()

KeyError: 'ali'.

nithinraok commented 3 years ago

did you train model with one of the label as ali ?

ali2iptoki commented 3 years ago

@nithinraok I trained a model on 10 speakers which not include the label "ali"

nithinraok commented 3 years ago

Are you making sure you are running this script (https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_recognition/speaker_reco_infer.py)? Please read the script clearly.

ali2iptoki commented 3 years ago

@nithinraok Yes I trained the model using this script on an4, then I fine tune this model on an4test_clstk where I keep out one of the speaker folder outside this an4test_clstk dataset. So,an4test_clstk includes now only 9 users. So the finetune model has 9 classes.

for the speaker (i.e. a directory including several .wav) i leaved out i take one .wav file from and i get its embedding as:

model_name =  "path to the tuned model/SpeakerNet.nemo"
embs = restored_model_tuned.get_embedding('/data/an4/wav/fcaw/an406-fcaw-b.wav')
embs

So I get : tensor([[ 0.0628, 0.8158, 1.0182, -1.5353, -0.3223, -1.0309, -0.4724, -0.3431, -1.5693, 0.1261, 0.0043, 0.8729, -0.3124, -1.2042, 0.5499, -0.7432, 1.3909, 0.9304, -0.3645, -0.7780, -0.4095, 1.2481, -1.0341, -0.1919, -0.1197, 0.1906, 1.0383, -0.3214, -0.9363, 0.7387, 1.0696, 1.5287, -1.3526, -0.8538, 0.3797, 0.8821, -0.7222, -0.3843, 0.7459, 0.9129, -1.3171, 2.0432, -0.4796, 0.2556, -0.9999, -0.5811, -1.0879, 0.1496, 0.4270, 0.5138, 0.2044, -0.0226, -1.3580, 0.5890, -1.8492, -0.1540, -0.1218, -1.1495, -0.9455, -1.2832, 0.2307, 1.4360, -0.6681, -0.5309, 1.0901, -0.5038, -0.9066, 0.3398, 1.2139, 0.0351, -0.3148, 1.4173, -0.0249, 0.0476, 0.5739, 0.7722, -0.2927, 1.8214, 1.3135, -0.2221, -0.8521, 0.9746, -0.8278, -1.2059, -2.3147, 0.7215, 0.4020, 0.3031, -0.2162, -0.3377, -1.3632, -1.6063, 0.2243, 0.3508, -0.7327, -0.3402, -0.0525, 1.1609, 1.9040, 0.0563, 1.5826, 0.8405, 0.7513, 0.8623, 1.4027, 1.1775, -1.0972, 0.9596, 1.9852, 0.1637, -0.8007, -0.6635, 1.8842, 0.6261, -0.1851, -0.2628, 0.4175, 0.6944, -1.4411, 0.3121, -0.5813, 0.4987, -0.7594, -0.6640, -0.1867, 0.3791, 0.5758, -0.2246, 0.5192, -0.4775, 0.3416, -0.7758, 0.7634, -0.0862, 0.4723, -0.5569, -0.0256, 0.0503, -1.0169, -0.1003, 0.5141, 2.1522, -0.3198, 0.4960, 0.1381, 0.6249, -0.3770, 0.0555, 0.4514, 0.7407, 0.5322, 0.6873, -0.1153, 0.9056, -0.6777, 0.3232, 1.3341, -0.1190, -0.9058, 0.2725, -0.2453, 0.4950, -0.2520, -0.1055, 0.9721, -0.0037, 0.8440, 0.5853, -1.0333, -0.2512, -0.4601, 0.5693, -2.5064, -0.0537, -0.8892, 1.8675, -2.1216, 0.3092, 0.6009, -2.4113, -0.5679, 1.8963, 0.1143, -0.9205, 0.3332, -0.9309, -0.4478, 0.2629, -1.2584, 0.0832, 0.0703, -0.9151, -0.3628, -0.3248, 1.0698, 1.5338, -2.2140, 0.5097, -1.4500, -0.5145, -0.2576, 1.3373, 0.2425, 0.6314, -0.2355, -0.5822, -0.3950, -0.4419, -0.6131, 0.2085, -0.7185, 0.4203, 0.3511, 0.2807, 1.3838, -0.0275, -0.6528, 1.3631, 0.0490, -0.7483, -0.1411, 1.1782, 0.8723, 0.0349, -1.2281, -1.1216, -1.2506, 0.0251, 0.2119, -0.8995, 1.8134, -1.9559, -0.1287, -1.5117, -0.2602, 0.7671, 0.5615, -0.6297, -1.3117, -0.2834, 0.3295, -1.1493, -0.7166, -0.5089, 0.1731, -0.6314, 1.2046, -0.4664, -2.4469, 0.8355, 1.3662, -1.9601, -0.0029, 2.4909, -1.0977, 1.6115, -1.2090, 0.1100, 0.5226, -0.9079, -0.7301, -0.8700, -0.4344, 0.2953, 0.5744, 0.0756, 0.5443, -1.1131, 0.2120, 0.1031, -0.0898, -1.4411, -1.0233, -1.1302, 0.2796, 0.1526, -0.5348, 1.0946, 0.0495, 1.4064, 0.2829, -0.4838, 1.5434, -0.1496, 0.2943, 0.8958, 0.1034, 1.0696, 2.0203, -0.2404, -0.1914, -0.1386, 0.2784, 0.2057, -0.5085, 0.0162, 0.4139, -0.9091, -1.0556, -0.2362, -1.5062, 1.0594, -1.6849, -0.1765, -1.3083, -1.0235, -0.0536, 0.9683, 0.2545, 0.6179, -0.4821, 0.4345, -0.4641, 0.3064, 0.5140, 0.8807, -0.6108, 0.5328, 1.2156, -1.2854, -0.4222, 1.8813, 0.8873, -0.3094, 0.4829, -0.0651, -1.8508, 1.6000, -0.0474, 1.5053, 0.0736, 0.4032, -0.2092, 0.3491, -0.1942, 0.3481, 1.5789, 1.4165, -1.7132, 1.1152, 1.6150, 2.1679, 0.8146, 0.0437, -0.9857, 0.7709, 0.1608, -1.1367, -0.5146, 1.4041, -2.2360, 0.5758, -0.7333, -0.1009, 0.9430, 0.6600, -0.4132, -0.2258, 0.9846, 0.3689, -0.8209, -0.2413, -0.4222, 0.5732, -0.0594, 0.3323, -0.3429, -1.1169, 1.7542, 0.5960, 0.1414, -0.3205, -0.9614, 1.0233, 0.2286, -0.9375, -0.8822, 1.5812, -0.0318, -0.2842, 1.4292, 0.2807, 0.3899, 0.6885, -0.3253, 0.1750, 0.9151, -0.7703, -0.6582, 0.3794, 1.1303, -1.5559, -0.4148, 0.1063, -0.9048, -0.1196, -0.0712, 1.5589, 1.8074, 0.3628, 1.0805, -1.2641, -0.5647, -2.2094, -0.0438, -1.2688, -0.2659, 1.0689, -0.1788, 0.5957, -0.0506, 1.0473, 0.0565, -0.2183, -0.7412, -0.1839, -0.9485, -0.7508, -1.6509, 2.3409, 0.8754, 0.6159, 1.5166, 1.0600, 1.0329, 0.8794, 1.4565, 0.2864, 1.1625, -0.9837, -1.7411, 1.6233, -1.9482, -0.3844, -0.2919, -0.3689, 0.6861, -0.4511, -0.3608, 0.6986, 1.6351, -0.8251, 0.1964, 0.3781, -0.1032, -0.5941, 0.9788, 0.5949, 0.6924, -0.2773, -1.4484, -0.7374, -0.5549, -1.0321, 0.3914, -0.2236, -0.4248, 1.4976, 0.4575, 0.5289, 1.0058, -0.2888, -0.8227, 0.8932, 0.8596, 0.1890, -1.5829, 0.9727, 0.2821, 1.3765, 1.7369, -0.2524, -2.1842, -0.4125, 0.1442, 1.6489, 0.3097, 1.2120, 0.4387, 0.4237, -0.1217, -0.7276, 0.9319, -1.2512, 0.2688, -0.4336, -0.8974, -1.5635, 0.3878, -0.7559, 0.9945, 2.7519, -1.0840, 1.0679, -0.0068, 1.7468, -0.9091, 0.2995, 0.4645, -1.1787, -1.8900, 0.0734, 0.0973, 0.3678, 0.6867, 0.1480, 0.6282, 0.4516, -0.3070, -0.9274, -0.0738, -1.9139]], device='cuda:0').

Then to infer the labels of this speaker I prepare an unk.json file: {"audio_filepath": "/data/an4/wav/fcaw/an406-fcaw-b.wav", "duration": 2.2, "label": "fcaw"}

and then in the script i did:

test_manifest = "/data/an4/wav/fcaw/unk.json"
speaker_model.setup_test_data(
        test_data_layer_params={
            'sample_rate': 16000,
            'manifest_filepath': test_manifest,
            'labels': labels_map,
            'batch_size': 32,
            'trim_silence': False,
            'shuffle': False,
        }
    )

Then

stime = time.time()
for test_batch in tqdm(speaker_model.test_dataloader()):

  if can_gpu:
    test_batch = [x.cuda() for x in test_batch]

  with autocast():
    audio_signal, audio_signal_len, labels, _ = test_batch
    logits, _ = speaker_model.forward(input_signal=audio_signal, input_signal_length=audio_signal_len)
    logging.info( logits.cpu().detach().numpy())
    logging.info(type(all_logits))
    if can_gpu:

      all_logits.extend(logits.cpu().detach().numpy())
      all_labels.extend(labels.cpu().numpy())
    else:
      all_logits.extend(logits.detach().numpy())
      all_labels.extend(labels.detach().numpy())
etime= time.time()
timer(stime, etime) 

And the error is raised:

 362 
    363         if not self.is_regression_task:
--> 364             t = torch.tensor(self.label2id[sample.label]).long()
    365         else:
    366             t = torch.tensor(sample.label).float()

KeyError: 'fcaw'
nithinraok commented 3 years ago

can you show me where these lines

stime = time.time() for test_batch in tqdm(speaker_model.test_dataloader()):

if can_gpu: test_batch = [x.cuda() for x in test_batch]

with autocast(): audio_signal, audio_signallen, labels, = testbatch logits, = speaker_model.forward(input_signal=audio_signal, input_signal_length=audio_signal_len) logging.info( logits.cpu().detach().numpy()) logging.info(type(all_logits)) if can_gpu:

  all_logits.extend(logits.cpu().detach().numpy())
  all_labels.extend(labels.cpu().numpy())
else:
  all_logits.extend(logits.detach().numpy())
  all_labels.extend(labels.detach().numpy())

etime= time.time() timer(stime, etime)

and also:

if not self.is_regression_task: --> 364 t = torch.tensor(self.label2id[sample.label]).long() 365 else: 366 t = torch.tensor(sample.label).float()

are part of https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_recognition/speaker_reco_infer.py

ali2iptoki commented 3 years ago

@nithinraok This lines I took from your script in inference file and added them in my notebook. If I test of a user already it is labels seen in the training the inference works but if the label was not in see it will generate the error above but is able to generate its embedding vector. SOme line are added to just debug but the big lines are from your script.

Indeed, in https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_recognition/speaker_reco_infer.py I think you need to correct line 21: --test_manifest=/path/to/train/manifest/file' to be --test_manifest=/path/to/test/manifest/file'

nithinraok commented 3 years ago

Isn't that obvious? How do you expect to get a label for the speaker which is not part of the training labels but only to be part of inference labels? Also, I suggest you go through the script very clearly on how training and inference labels mapped. Yes, it's the path to test manifest file (argument name)

You may have to understand speaker-related tasks in more depth, on what is the difference between speaker identification, verification, and diarization. Hopefully, this helps: https://github.com/NVIDIA/NeMo/issues/1710#issuecomment-776261922 . You may also need to understand from the literature, how these characteristic speaker embeddings are useful in differentiating one unseen speaker from other.

ali2iptoki commented 3 years ago

@nithinraok yes it is obvious, but i was confused that the model are able to generate embedding for a very new .wav file for a user not the training set. So, usually, if the label of a user not in the training set the model shoudl not generate an embedding.

ali2iptoki commented 3 years ago

@nithinraok the link spkr_get_emb.py in #1710 (comment) is not working. can you please verify. I appreciate your help.

nithinraok commented 3 years ago

Oh, its extraction of speaker embeddings script.

ali2iptoki commented 3 years ago

@nithinraok Using the predefined model Speaker recognition :

model = PretrainedModelInfo(
            pretrained_model_name="SpeakerNet_recognition",
            location="https://api.ngc.nvidia.com/v2/models/nvidia/nemospeechmodels/versions/1.0.0a5/files/SpeakerNet_recognition.nemo",
            description="SpeakerNet_recognition model trained end-to-end for speaker recognition purposes with cross_entropy loss. It was trained on voxceleb 1, voxceleb 2 dev datasets and augmented with musan music and noise. Speaker Recognition model achieves 2.65% EER on voxceleb-O cleaned trial file",
        )

In the description they mention voxceleb-O please where can I find file. It is equivalent to which file in https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html under Dataset split for identification.

Indeed, please: in reco_infer.py the following files are also required: train_manifest and test file. So please correct me: the test file will be the corresponding of voxceleb-O in json format. So where can I find all these files please.

nithinraok commented 3 years ago

VoxCeleb-O is a trial file that is used for verification purposes. Link. It is not a manifest file.

Manifest files are in JSON format that contains several rows of the train, valid, and test samples with keys audio_filepath, offset, duration, and label in each row. Look at speaker recognition training tutorial on how we create sample training manifest file.

Note: Speaker Recognition, Verification, and Diarization are all three different tasks, you may have to keenly understand them before you refer to any of the scripts, else it will only add you to confusion.

ali2iptoki commented 3 years ago

@nithinraok Thanks for your hints. I have a question regarding the values generated by the script speaker_reco_infer.py. I get, for example for an4, the following results (I will show the first speaker results):

 all_logits[0] = [-1.995   1.678  -2.535   1.739  -1.728  -1.268  -0.727  -3.385  -2.348
     -3.021   0.5293 -0.4573  0.5137 -3.047  -4.75   -1.847   2.922  -0.989
     -1.507  -0.9224 -2.545   6.957   0.9985 -2.035  -3.234  -2.848  -1.971
     -3.246   2.057  -1.991  -6.27    9.22    0.4045 -2.703  -1.577   4.066
      7.215  -4.07   12.98   -3.02    1.456   9.44    6.49    0.272   2.07
      1.625  -3.531  -2.846  -4.914  -0.536  -3.496  -1.095  -2.719  -0.5825
      5.535  -0.1753  3.658   4.234   4.543  -0.8384 -2.705  -2.012  -6.56
     10.5    -2.021  -2.48    1.725   5.69    3.672  -6.855  -3.887   1.761
      6.926  -4.848 ]

where each value represent "how much" the given speaker belongs to the corresponding label (i.e. a speaker from the 74 existing speakers or labels). In this script you select the top1 as the label to infer.

  1. What does each value represent?
  2. How can transform these values to be between 0 and 1 where each value represents the probability to be in the corresponding class/label.
  3. if we add the following code in speaker_reco_infer.py
  scaler = MinMaxScaler()
 all_logits = scaler.fit_transform(all_logits)
 all_logits = normalize(all_logits, norm='l1', axis=1, copy=True)

The result completely change. We get a higher accuracy (for example on an4)