ali2iptoki commented 3 years ago

I have a question if you can explain it please:

In the embedding_manifest.json you add two .wav paths. You model spkr.nemo is only trained on the an4 dataset only. So the paths in embedding_manifest.json are not part of the training set (train.json), correct me please
As a result you get 0.96 between your own .wav in embedding_manifest.json, how can interpret that result? I mean we can say that both are for the same user (i.e. speaker), But how can see which speaker?

JINHXu commented 3 years ago

Hi!

In the embedding_manifest.json you add two .wav paths. You model spkr.nemo is only trained on the an4 dataset only. So the paths in embedding_manifest.json are not part of the training set (train.json), correct me please

yes, two items in embedding_manifest.json (test manifest) are not in the training data

As a result you get 0.96 between your own .wav in embedding_manifest.json, how can interpret that result? I mean we can say that both are for the same user (i.e. speaker), But how can see which speaker?

The number 0.96 stands for the cosine similarity between the two speaker embeddings.

In order to solve the problem of identifying a speaker (multi-class classification problem, who is talking), please refer to (speaker identification)[https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/intro.html], the examples illustrated in this repository serves only the purpose of verification (binary, verified or not)

But how can see which speaker?

For speaker identification:

configuration
fine-tune [SpeakerNet](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/models.html#speakernet) model on one's own dataset (data of speakers intended to be identified in a task)
inference on test data (manifest)

I will later create a comprehensible tutorial on speaker identification with nemo here in this repo.

ali2iptoki commented 3 years ago

@JINHXu Thanks for your reply. I think even even the two items in embedding_manifest.json (test manifest) are not in the training data but the label of these speakers are in the training set.

My other question please, when we did a speaker identification we get some values > 1 or even negative. In the script speaker_reco_infer.py the take the argmax. So, How can we just limit the values (logits) to be between 0 and 1.

JINHXu commented 3 years ago

the logits you get in inference step for speaker identification is not cosine similarity, they stand for the confidence on each label therefore it would take argmax.

for example there are [l1, l2, l3] and the logits [1,2,3] stands for confidence on each corresponding label on one testing item, in this example the inference would be l3 since it is the argmax

cosine similarity would always be within [-1, 1], but the logits in speaker inference are NOT cosine similarity values.

ali2iptoki commented 3 years ago

@JINHXu yeah sure. it is clear that in [1,2,3] the max is 3 and the infered label will speaker 3, but instead of confidence i am looking to get have a probabilty like [0.2 0.1,0.7] which been the corresponding speaker that we want to identify has 20% to be as speaker 1, 10% to be belong to speaker 2 and 70% to belong to speaker 3.

JINHXu commented 3 years ago

you can normalize the list of logits

ali2iptoki commented 3 years ago

@JINHXu i dont think so since there are some values negative.

all_logits[0] = [-1.995   1.678  -2.535   1.739  -1.728  -1.268  -0.727  -3.385  -2.348
     -3.021   0.5293 -0.4573  0.5137 -3.047  -4.75   -1.847   2.922  -0.989
     -1.507  -0.9224 -2.545   6.957   0.9985 -2.035  -3.234  -2.848  -1.971
     -3.246   2.057  -1.991  -6.27    9.22    0.4045 -2.703  -1.577   4.066
      7.215  -4.07   12.98   -3.02    1.456   9.44    6.49    0.272   2.07
      1.625  -3.531  -2.846  -4.914  -0.536  -3.496  -1.095  -2.719  -0.5825
      5.535  -0.1753  3.658   4.234   4.543  -0.8384 -2.705  -2.012  -6.56
     10.5    -2.021  -2.48    1.725   5.69    3.672  -6.855  -3.887   1.761
      6.926  -4.848 ]

JINHXu commented 3 years ago

it is possible to normalize list of positive and negative numbers

ali2iptoki commented 3 years ago

@JINHXu Yes sure we can do. I tried to make a normalisation on each row like this:

logging.info(f" vector all_logits[1] = {self[1]}")
logging.info("\n*****")
logging.info(f"the best value is at index = {self[1].argmax(axis=0)}")
logging.info("\n*****")
logging.info(f"This value is = {self[1][self[1].argmax(axis=0)]}")
print(f"=========================")

min_val = min(self[1])
max_val = max(self[1])
normalized_arr = (self[1] - min_val)/ (max_val - min_val)

normalized_arr = normalized_arr/np.sum(normalized_arr)
logging.info(f"the best value is at index = {normalized_arr.argmax(axis=0)}")

otherwise if we dont make the normalisation above but we add the following :

added to normalise the data

from sklearn.preprocessing import MinMaxScaler, normalize
scaler = MinMaxScaler()
all_logits = scaler.fit_transform(all_logits)
all_logits = normalize(all_logits, norm='l1', axis=1, copy=True)

in the reco_infer script we get a higher result, for instance:


for test_batch in tqdm(speaker_model.test_dataloader()):
        if can_gpu:
            test_batch = [x.cuda() for x in test_batch]
        with autocast():
            audio_signal, audio_signal_len, labels, _ = test_batch
            logits, _ = speaker_model.forward(input_signal = audio_signal, input_signal_length=audio_signal_len)
            #logits = 1/(1+np.exp(logits.cpu().detach().numpy()))
            all_logits.extend(logits.cpu().detach().numpy())
            all_labels.extend(labels.cpu().numpy())

all_logits, true_labels = np.asarray(all_logits), np.asarray(all_labels)
infer_labels = all_logits.argmax(axis=1)

self = all_logits
out_manifest = os.path.basename(test_manifest).split('.')[0] + '_infer_nemo.json'
out_manifest = os.path.join(os.path.dirname(test_manifest), out_manifest)

# added to normalise the data
from sklearn.preprocessing import MinMaxScaler, normalize
scaler = MinMaxScaler()
all_logits = scaler.fit_transform(all_logits)
all_logits = normalize(all_logits, norm='l1', axis=1, copy=True)

ali2iptoki commented 3 years ago

@JINHXu Thanks for your reply. I think even even the two items in embedding_manifest.json (test manifest) are not in the training data but the label of these speakers are in the training set? correct me please.

Indeed, on hi-mia do you have any estimation on the best configuration to train using speakrnet?

JINHXu commented 3 years ago

I think even even the two items in embedding_manifest.json (test manifest) are not in the training data but the label of these speakers are in the training set? correct me please.

ummmm... actually I think labels in the training data are the same as labels in the training set, if you mean training set by training manifest, coz manifests are created based on data so the labels are the same set

Indeed, on hi-mia do you have any estimation on the best configuration to train using speakrnet?

I have not yet tried, I have been doing experiments only with the default configuration. May I ask if you are training speaker identification model only based on short wake-up words such as "hi, mia"?

JINHXu / speaker-verification

how to interpret the result #1

added to normalise the data