astorfi / 3D-convolutional-speaker-recognition

:speaker: Deep Learning & 3D Convolutional Neural Networks for Speaker Verification
Apache License 2.0
781 stars 275 forks source link

Can you provide input pipeline? #1

Closed wuqiangch closed 6 years ago

wuqiangch commented 7 years ago

Can you provide input pipeline? Thanks!

astorfi commented 7 years ago

Thank you for your consideration ... Input pipeline procedure is provided in the paper ... The dataset itself is restricted and unfortunately, cannot be shared.

wuqiangch commented 7 years ago

@astorfi Thanks. Maybe you can show a example. Gived one person's 20utterances, how to get the MFECs feature and save it in hdf5?Thanks!

astorfi commented 7 years ago

@wuqiangch I will try to provide the input pipeline process very soon ... However, this is very database related and is bot necessarily customizable.

astorfi commented 7 years ago

@wuqiangch The input pipeline has been added.

wuqiangch commented 7 years ago

@astorfi Thanks! But I have some questions: 1.How to get mfec feature(feature_mfec.npy), using speechpy.lmfe() function? 2.In your code "feed_to_hdf5" ,you also using features of one sound file to generate one 3D data,not 20 different utterances? 3..Does it work for one channel data?

astorfi commented 7 years ago
  1. Yes, I used SpeechPy for feature extraction. The running example is available in the package documentation.
  2. "feed_to_hdf5" is supposed to generate one 3D data but I should add it also embed 20(or arbitrary) different utterances belonging to the same speaker.
  3. On what sense you mean one channel data? If you mean 1 speaker utterance instead of 20 for example, I should say yes.
wuqiangch commented 7 years ago

@astorfi It means that the sound file has only one channel. It is work ?

astorfi commented 7 years ago

Yes ... But certainly, the input pipeline must be customized!

wuqiangch commented 7 years ago

@astorfi 1.Did you use the feature cube vector which contains the static, first and second derivative features?

  1. The feature you saved in "feature_mfec.npy" is only one utterance's feature or many utterances' feature? 3.How to deal with different length of different utterance? If 20 utterances have different frames, how to generate a 3D sample? Randomly choose 80 frames from each utterance to generate a 3D sample 80x40x20?
astorfi commented 7 years ago

@wuqiangch The default for my experiments is 20 utterances per speaker ... feature_mfec.npy is for the whole sound file uttered by the speaker. From that, the features will be extracted frame-wise ...For your question "If 20 utterances have different frames, how to generate a 3D sample" : It is related to the input pipeline. The dimensionality of the input should be correct. The rest is how to connect frames with speaker utterances and form a cube.

wuqiangch commented 7 years ago

@astorfi Thanks! Did you normalize the data by minus the mean and dive the std of of the whole train data ?

astorfi commented 7 years ago

@wuqiangch No ... Normalization did not make major changes in accuracy ... So I left it as it was ... Although definitely there is no harm doing data standardization for sure!

wuqiangch commented 7 years ago

@astorfi

  1. Have test the mfcc feature? how much worse than mfec feature?
  2. Can you provide the method of Voice Activity Detection or where can i find the code of vad?
  3. It only extracts only 80 frame from one utterance as the feature of the utterance? Oh,I think it wastes a large amount of data。
  4. if one speaker have 300 utterance ,so it can get only 150 3D training data for this speaker?
astorfi commented 7 years ago

@wuqiangch 1- I do not remember by heart how much MFEC is better but I am sure it was better due to locality property. 2- There is a MATLAB package for Voice Activity Detection named VOICEBOX. 3- It is just a sample dataset for running the code. For sure you must create numerous frames from one sound file. 4- You can use overlapping frames for generating more utterances. However, 150 training cubes for a speaker is a lot! It's like 150 faces per subject for image classification,

wuqiangch commented 7 years ago

@astorfi There is someting wrong with my train. training setting: -num_epochs=1000 --batch_size=128 Epoch 1, Minibatch 3 of 3756 , Minibatch Loss= 6.1470, TRAIN ACCURACY= 4.762 Epoch 1, Minibatch 4 of 3756 , Minibatch Loss= 5.6256, TRAIN ACCURACY= 7.143 Epoch 1, Minibatch 5 of 3756 , Minibatch Loss= 5.0327, TRAIN ACCURACY= 21.429 Epoch 1, Minibatch 6 of 3756 , Minibatch Loss= 4.3759, TRAIN ACCURACY= 52.381 Epoch 1, Minibatch 7 of 3756 , Minibatch Loss= 6.6092, TRAIN ACCURACY= 2.381 Epoch 1, Minibatch 8 of 3756 , Minibatch Loss= 6.6603, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 9 of 3756 , Minibatch Loss= 6.4434, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 10 of 3756 , Minibatch Loss= 6.0245, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 11 of 3756 , Minibatch Loss= 5.6892, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 12 of 3756 , Minibatch Loss= 5.1381, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 13 of 3756 , Minibatch Loss= 6.2450, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 14 of 3756 , Minibatch Loss= 7.0499, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 15 of 3756 , Minibatch Loss= 6.9013, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 16 of 3756 , Minibatch Loss= 6.5872, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 17 of 3756 , Minibatch Loss= 6.1402, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 18 of 3756 , Minibatch Loss= 5.6131, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 19 of 3756 , Minibatch Loss= 5.5774, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 20 of 3756 , Minibatch Loss= 7.2767, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 21 of 3756 , Minibatch Loss= 7.0936, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 22 of 3756 , Minibatch Loss= 6.7943, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 23 of 3756 , Minibatch Loss= 6.4498, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 24 of 3756 , Minibatch Loss= 5.8998, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 25 of 3756 , Minibatch Loss= 5.2739, TRAIN ACCURACY= 0.000

astorfi commented 7 years ago

@wuqiangch I believe it's very unstable ... Let it run for 50 epochs at least ... Then we can investigate it more.

wuqiangch commented 7 years ago

what's wrong ? Epoch 79, Minibatch 1028 of 3756 , Minibatch Loss= 8.6112, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1029 of 3756 , Minibatch Loss= 7.6331, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1030 of 3756 , Minibatch Loss= 6.8144, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1031 of 3756 , Minibatch Loss= 5.6934, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1032 of 3756 , Minibatch Loss= 12.7493, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1033 of 3756 , Minibatch Loss= 16.9590, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1034 of 3756 , Minibatch Loss= 16.2862, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1035 of 3756 , Minibatch Loss= 15.8414, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1036 of 3756 , Minibatch Loss= 15.1954, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1037 of 3756 , Minibatch Loss= 14.0057, TRAIN ACCURACY= 0.000


This is my process: For one person, I extract features of all its all sound files and stack all frames of all all sound files . I only chose 1000 3D training samples (20,80,40) of each person.

astorfi commented 7 years ago

You are in epoch 80!! For any training data, it should slightly go to convergence level so far! I don't think it's related to the code! Although you are using my code. I would say check the implementation in detail. Like if you are missing something or you modified anything by mistake. Also check learning rate too. It's weird to my eyes. Please stay in touch. I will do my best to help.

wuqiangch commented 7 years ago

--num_epochs=1000 --batch_size=128 I usd LibriSpeech dataset with about 2500 pesons. For each person ,I chose 1000 training sample. I have trained it for three days using gpu. But the acc is always zero. Must I change some training parameters?

wuqiangch commented 7 years ago

@astorfi Can your share your pretrainded mode?

astorfi commented 7 years ago

@wuqiangch Unfortunately not ... Because it's been trained on a non-publicly available dataset.

astorfi commented 7 years ago

@wuqiangch Do you use batch normalization? What about data standardization? What is the initial learning rate?

wuqiangch commented 7 years ago

@astorfi I dont change anything in the model as you provide. I dont use data standardization ,only using the orignal mfec feature. I dont change the initial learning rate(you set it 10).

astorfi commented 7 years ago

@wuqiangch Actually I lost the thread ... The point is, regardless of my code, and even the architecture me and you are using, with such a huge training data, you should be able to get to the point of convergence at least for training even if you get 0 percent accuracy for evaluation!

I do not know if you are creating your data in a correct way. Even with that, I believe training accuracy should be increasing in any sense.

wuqiangch commented 7 years ago

@astorfi ,I used the feature(two persons) you provided.-num_epochs=1000 --batch_size=32 Epoch 1000, Minibatch 1 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 2 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 3 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 4 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 5 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 6 of 25 , Minibatch Loss= 0.1261, TRAIN ACCURACY= 60.000 Epoch 1000, Minibatch 7 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 8 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 9 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 10 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 11 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 12 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 13 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 14 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 15 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 16 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 17 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 18 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 19 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 20 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 21 of 25 , Minibatch Loss= 0.0210, TRAIN ACCURACY= 10.000 Epoch 1000, Minibatch 22 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 23 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 24 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 25 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 TESTING after finishing the training on: epoch 1000 Test Accuracy 1000, Mean= 75.0000, std= 43.301

astorfi commented 7 years ago

You std is a lot ... You may have to do mean subtraction and standardization of the data ... It's not defined in the code by default.

wuqiangch commented 7 years ago

@astorfi I did mean subtraction and standardization of the data .But it not work too.You can use the feature(two persons) you provided to train the model and show me the result. Thanks!

astorfi commented 7 years ago

@wuqiangch Please run the recently updated run.sh file to see the results. Moreover, regardless of my architecture, you should be able to modify the hyperparameters to get almost perfect results at least on training. It's just a softmax!

unplugg3d commented 6 years ago

hello! I could not find the pipeline preparation example. Do you mind telling me where is it?

thank you in advance