Closed ali2iptoki closed 3 years ago
Hi!
- In the embedding_manifest.json you add two .wav paths. You model spkr.nemo is only trained on the an4 dataset only. So the paths in embedding_manifest.json are not part of the training set (train.json), correct me please
yes, two items in embedding_manifest.json
(test manifest) are not in the training data
- As a result you get 0.96 between your own .wav in embedding_manifest.json, how can interpret that result? I mean we can say that both are for the same user (i.e. speaker), But how can see which speaker?
The number 0.96 stands for the cosine similarity between the two speaker embeddings.
In order to solve the problem of identifying a speaker (multi-class classification problem, who is talking), please refer to (speaker identification)[https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/intro.html], the examples illustrated in this repository serves only the purpose of verification (binary, verified or not)
But how can see which speaker?
For speaker identification:
configuration
fine-tune [SpeakerNet](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/models.html#speakernet)
model on one's own dataset (data of speakers intended to be identified in a task)
inference on test data (manifest)
I will later create a comprehensible tutorial on speaker identification with nemo
here in this repo.
@JINHXu Thanks for your reply. I think even even the two items in embedding_manifest.json (test manifest) are not in the training data but the label of these speakers are in the training set.
My other question please, when we did a speaker identification we get some values > 1 or even negative. In the script speaker_reco_infer.py the take the argmax. So, How can we just limit the values (logits) to be between 0 and 1.
the logits you get in inference step for speaker identification is not cosine similarity, they stand for the confidence on each label therefore it would take argmax.
for example there are [l1, l2, l3] and the logits [1,2,3] stands for confidence on each corresponding label on one testing item, in this example the inference would be l3 since it is the argmax
cosine similarity would always be within [-1, 1], but the logits in speaker inference are NOT cosine similarity values.
@JINHXu yeah sure. it is clear that in [1,2,3] the max is 3 and the infered label will speaker 3, but instead of confidence i am looking to get have a probabilty like [0.2 0.1,0.7] which been the corresponding speaker that we want to identify has 20% to be as speaker 1, 10% to be belong to speaker 2 and 70% to belong to speaker 3.
you can normalize the list of logits
@JINHXu i dont think so since there are some values negative.
all_logits[0] = [-1.995 1.678 -2.535 1.739 -1.728 -1.268 -0.727 -3.385 -2.348
-3.021 0.5293 -0.4573 0.5137 -3.047 -4.75 -1.847 2.922 -0.989
-1.507 -0.9224 -2.545 6.957 0.9985 -2.035 -3.234 -2.848 -1.971
-3.246 2.057 -1.991 -6.27 9.22 0.4045 -2.703 -1.577 4.066
7.215 -4.07 12.98 -3.02 1.456 9.44 6.49 0.272 2.07
1.625 -3.531 -2.846 -4.914 -0.536 -3.496 -1.095 -2.719 -0.5825
5.535 -0.1753 3.658 4.234 4.543 -0.8384 -2.705 -2.012 -6.56
10.5 -2.021 -2.48 1.725 5.69 3.672 -6.855 -3.887 1.761
6.926 -4.848 ]
it is possible to normalize list of positive and negative numbers
@JINHXu Yes sure we can do. I tried to make a normalisation on each row like this:
logging.info(f" vector all_logits[1] = {self[1]}")
logging.info("\n*****")
logging.info(f"the best value is at index = {self[1].argmax(axis=0)}")
logging.info("\n*****")
logging.info(f"This value is = {self[1][self[1].argmax(axis=0)]}")
print(f"=========================")
min_val = min(self[1])
max_val = max(self[1])
normalized_arr = (self[1] - min_val)/ (max_val - min_val)
normalized_arr = normalized_arr/np.sum(normalized_arr)
logging.info(f"the best value is at index = {normalized_arr.argmax(axis=0)}")
otherwise if we dont make the normalisation above but we add the following :
from sklearn.preprocessing import MinMaxScaler, normalize
scaler = MinMaxScaler()
all_logits = scaler.fit_transform(all_logits)
all_logits = normalize(all_logits, norm='l1', axis=1, copy=True)
in the reco_infer script we get a higher result, for instance:
for test_batch in tqdm(speaker_model.test_dataloader()):
if can_gpu:
test_batch = [x.cuda() for x in test_batch]
with autocast():
audio_signal, audio_signal_len, labels, _ = test_batch
logits, _ = speaker_model.forward(input_signal = audio_signal, input_signal_length=audio_signal_len)
#logits = 1/(1+np.exp(logits.cpu().detach().numpy()))
all_logits.extend(logits.cpu().detach().numpy())
all_labels.extend(labels.cpu().numpy())
all_logits, true_labels = np.asarray(all_logits), np.asarray(all_labels)
infer_labels = all_logits.argmax(axis=1)
self = all_logits
out_manifest = os.path.basename(test_manifest).split('.')[0] + '_infer_nemo.json'
out_manifest = os.path.join(os.path.dirname(test_manifest), out_manifest)
# added to normalise the data
from sklearn.preprocessing import MinMaxScaler, normalize
scaler = MinMaxScaler()
all_logits = scaler.fit_transform(all_logits)
all_logits = normalize(all_logits, norm='l1', axis=1, copy=True)
@JINHXu Thanks for your reply. I think even even the two items in embedding_manifest.json (test manifest) are not in the training data but the label of these speakers are in the training set? correct me please.
Indeed, on hi-mia do you have any estimation on the best configuration to train using speakrnet?
I think even even the two items in embedding_manifest.json (test manifest) are not in the training data but the label of these speakers are in the training set? correct me please.
ummmm... actually I think labels in the training data are the same as labels in the training set, if you mean training set by training manifest, coz manifests are created based on data so the labels are the same set
Indeed, on hi-mia do you have any estimation on the best configuration to train using speakrnet?
I have not yet tried, I have been doing experiments only with the default configuration. May I ask if you are training speaker identification model only based on short wake-up words such as "hi, mia"?
I have a question if you can explain it please: