Walleclipse / Deep_Speaker-speaker_recognition_system

Keras implementation of ‘’Deep Speaker: an End-to-End Neural Speaker Embedding System‘’ (speaker recognition)
245 stars 81 forks source link

How to do inference with the pre-trained model #30

Closed tuanad121 closed 5 years ago

tuanad121 commented 5 years ago

Hi there, I'm looking for how to embed data of one speaker with the pre-trained model. In the training process, the input data has anchor, positive, and negative speaker. Each speaker has 32 sentences with 160 frames for each sentence. I wonder how we use the model to embed one speaker. Do we need to prepare anchor, positive and negative speaker as in training? Thanks for spend your time on my question. ^^

Walleclipse commented 5 years ago

In the inference, you do not need to prepare anchor ... The general inference process is as follows: 1) Prepare data: Get the audio, and extract MFCC features, just similar the training preprocess. Please view details in function extract_features which is located in pre_process.py 2) Predict on model : embedding = model.predict(input) Then you will get 512-dimensional embedding vector. 3) Use this embedding vector to verification or classification tasks.

Suppose there are two utterances that need to be verified if they were made by the same person. We first get the embedding vector for 2 utterances emb1= model.predict(utt1); emb2= model.predict(utt2) If the similarity between the two embeddings is greater than a certain threshold (e.g. 0.5), we can determine that the two utterances originated from the same person. utt1 and utt2 has the same speaker if sim(emb1,emb2)>threshold

tuanad121 commented 5 years ago

Thanks for your quick response. Much appreciated. I will try as you suggested. ^^

tuanad121 commented 5 years ago

cool, I can it. Here's the script:

import constants as c
from pre_process import extract_features
from models import convolutional_model
from utils import get_last_checkpoint_if_any

from scipy.io.wavfile import read
import numpy as np

def clipped_audio(x, num_frames=c.NUM_FRAMES):
    if x.shape[0] > num_frames + 20:
        bias = np.random.randint(20, x.shape[0] - num_frames)
        clipped_x = x[bias: num_frames + bias]
    elif x.shape[0] > num_frames:
        bias = np.random.randint(0, x.shape[0] - num_frames)
        clipped_x = x[bias: num_frames + bias]
    else:
        clipped_x = x

    return clipped_x

if __name__ == '__main__':
    model = convolutional_model()
    last_checkpoint = get_last_checkpoint_if_any(c.CHECKPOINT_FOLDER)
    print(last_checkpoint)

    model.load_weights(last_checkpoint)
    _, utt1 = read('demo/87-121553-0002.wav')
    utt1 = utt1 / (2**15 - 1)
    feat1 = extract_features(utt1)
    feat1 = clipped_audio(feat1)
    feat1 = feat1[np.newaxis, ...]

    _, utt2 = read('demo/103-1240-0002.wav')
    utt2 = utt2 / (2**15 - 1)
    feat2 = extract_features(utt2)
    feat2 = clipped_audio(feat2)
    feat2 = feat2[np.newaxis, ...]
    print(feat1.shape, feat2.shape)
    emb1 = model.predict(feat1)
    emb2 = model.predict(feat2)

    # similarity
    mul = np.multiply(emb1, emb2)
    s = np.sum(mul, axis=1)
    print(s)
cconst04 commented 4 years ago

I've tried the above code but each time i get a different similarity value. How is this possible?

Walleclipse commented 3 years ago

I've tried the above code but each time i get a different similarity value. How is this possible?

Hi, I think the main reason is the random clip in def clipped_autio in the code. The audio was random clipping before feeding into the model if the audio is too long.
You can modify the def clipped_autio to a deterministic clip. (clip the deterministic middle frames of the long audio, or clip different parts of the long audio then average the different results).
Or fixed the random seed during running the code, according to reproducible-results-neural-networks-keras.

LBShinChan commented 2 years ago

That's cool! Thank you very much.