astorfi / 3D-convolutional-speaker-recognition

:speaker: Deep Learning & 3D Convolutional Neural Networks for Speaker Verification
Apache License 2.0
781 stars 275 forks source link

speaker identification input WAV file #2

Closed GawainLee closed 6 years ago

GawainLee commented 6 years ago

Hi, I am using your function to recognize speaker identification.

I am new in machine learning, could you tell me, can I use this function to do the speaker recognition?

Now, I am able to run your function and out put some graphic, but I do not know how to use those graphic to recognize speaker.

And, I am trying to use my own WAV file to do the training. However, I get some error:

...3D-convolutional-speaker-recognition-master\code\0-input\create_hdf5\pair_generation.py", line 42, in feed_to_hdf5
    :, 0]
IndexError: too many indices for array
Closing remaining open files:development.hdf5...done

Default file, the feature_mfec.npy file like this:

array([[[  1.41430335e+01,   1.38114970e+00,   8.35106419e-02],
        [  1.41430335e+01,   1.36457288e+00,   8.67885390e-03],
        [  1.39772653e+01,   1.11641647e+00,  -2.56141085e-02],

        ...,
        [  9.24432067e+00,   9.21401209e-01,   1.09648406e-01],
        [  9.19465798e+00,   9.45404618e-01,   1.09012088e-01],
        [  9.37358081e+00,   9.68176377e-01,   1.03772331e-01]]])

I change my WAV change to npy and content like this:

array([[ 143.,  143.],
       [ 136.,  136.],
       [ 121.,  121.],
       ...,
       [  72.,   72.],
       [  81.,   81.],
       [  90.,   90.]])

My transform function:

import numpy as np
import scipy.io.wavfile as wav

temp_npy =wav.read('...\\19-198-0000.wav')

print(temp_npy)

result = np.array(temp_npy[1],dtype=float)

np.save('test_wav_r_values.npy', result)

Do I need to change other type WAV file or I should change other transform function from WAV to npy file?

I also find the import scipy.io.wavfile as wav in the create_development.py, can I just input the WAV for training feed?

Thank your for you provide this function.

astorfi commented 6 years ago

Thank you for your interest @GawainLee. The overall process is described in the associated paper. If you are new to the topic, please read the paper at first.

Bests

ovninosa commented 6 years ago

Astorfi,

I just read the paper and the repo, thanks for your time and effort.

I see the hdf5 file, this is the dataset features from all of the speakers, right?

I think the main idea of this question is about doing the training and verification with wav's dataset.

There is a simple way to do that?? Or I need to extract all the feature with a third party soft like SIDEKIT?

Thanks, Jave

astorfi commented 6 years ago

@javenosa Thanks for your kind words.

No ... The hdf5 is just a sample for showcasing how to represent the data to the network and give the demo the ability to run promptly ... You should generate your own custom data and the details are available in the associated paper.

HDF5 generation is not necessary for the general architecture design and model training. However, the code needs to be modified a little bit to incorporate the features in case you wanna you some other file format! Storing files in HDF5, TFrecords and etc formats is suggested though since it makes the whole process much simpler and faster although data file generation itself add a little bit of complexity.

The features must be extracted anyway. I used SpeechPy which is my own developed package for speech feature extracted but you use any package that you are more comfortable using.