Closed wuqiangch closed 6 years ago
Thank you for your consideration ... Input pipeline procedure is provided in the paper ... The dataset itself is restricted and unfortunately, cannot be shared.
@astorfi Thanks. Maybe you can show a example. Gived one person's 20utterances, how to get the MFECs feature and save it in hdf5?Thanks!
@wuqiangch I will try to provide the input pipeline process very soon ... However, this is very database related and is bot necessarily customizable.
@wuqiangch The input pipeline has been added.
@astorfi Thanks! But I have some questions: 1.How to get mfec feature(feature_mfec.npy), using speechpy.lmfe() function? 2.In your code "feed_to_hdf5" ,you also using features of one sound file to generate one 3D data,not 20 different utterances? 3..Does it work for one channel data?
@astorfi It means that the sound file has only one channel. It is work ?
Yes ... But certainly, the input pipeline must be customized!
@astorfi 1.Did you use the feature cube vector which contains the static, first and second derivative features?
@wuqiangch The default for my experiments is 20 utterances per speaker ... feature_mfec.npy is for the whole sound file uttered by the speaker. From that, the features will be extracted frame-wise ...For your question "If 20 utterances have different frames, how to generate a 3D sample" : It is related to the input pipeline. The dimensionality of the input should be correct. The rest is how to connect frames with speaker utterances and form a cube.
@astorfi Thanks! Did you normalize the data by minus the mean and dive the std of of the whole train data ?
@wuqiangch No ... Normalization did not make major changes in accuracy ... So I left it as it was ... Although definitely there is no harm doing data standardization for sure!
@astorfi
@wuqiangch 1- I do not remember by heart how much MFEC is better but I am sure it was better due to locality property. 2- There is a MATLAB package for Voice Activity Detection named VOICEBOX. 3- It is just a sample dataset for running the code. For sure you must create numerous frames from one sound file. 4- You can use overlapping frames for generating more utterances. However, 150 training cubes for a speaker is a lot! It's like 150 faces per subject for image classification,
@astorfi There is someting wrong with my train. training setting: -num_epochs=1000 --batch_size=128 Epoch 1, Minibatch 3 of 3756 , Minibatch Loss= 6.1470, TRAIN ACCURACY= 4.762 Epoch 1, Minibatch 4 of 3756 , Minibatch Loss= 5.6256, TRAIN ACCURACY= 7.143 Epoch 1, Minibatch 5 of 3756 , Minibatch Loss= 5.0327, TRAIN ACCURACY= 21.429 Epoch 1, Minibatch 6 of 3756 , Minibatch Loss= 4.3759, TRAIN ACCURACY= 52.381 Epoch 1, Minibatch 7 of 3756 , Minibatch Loss= 6.6092, TRAIN ACCURACY= 2.381 Epoch 1, Minibatch 8 of 3756 , Minibatch Loss= 6.6603, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 9 of 3756 , Minibatch Loss= 6.4434, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 10 of 3756 , Minibatch Loss= 6.0245, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 11 of 3756 , Minibatch Loss= 5.6892, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 12 of 3756 , Minibatch Loss= 5.1381, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 13 of 3756 , Minibatch Loss= 6.2450, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 14 of 3756 , Minibatch Loss= 7.0499, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 15 of 3756 , Minibatch Loss= 6.9013, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 16 of 3756 , Minibatch Loss= 6.5872, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 17 of 3756 , Minibatch Loss= 6.1402, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 18 of 3756 , Minibatch Loss= 5.6131, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 19 of 3756 , Minibatch Loss= 5.5774, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 20 of 3756 , Minibatch Loss= 7.2767, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 21 of 3756 , Minibatch Loss= 7.0936, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 22 of 3756 , Minibatch Loss= 6.7943, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 23 of 3756 , Minibatch Loss= 6.4498, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 24 of 3756 , Minibatch Loss= 5.8998, TRAIN ACCURACY= 0.000 Epoch 1, Minibatch 25 of 3756 , Minibatch Loss= 5.2739, TRAIN ACCURACY= 0.000
@wuqiangch I believe it's very unstable ... Let it run for 50 epochs at least ... Then we can investigate it more.
what's wrong ? Epoch 79, Minibatch 1028 of 3756 , Minibatch Loss= 8.6112, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1029 of 3756 , Minibatch Loss= 7.6331, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1030 of 3756 , Minibatch Loss= 6.8144, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1031 of 3756 , Minibatch Loss= 5.6934, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1032 of 3756 , Minibatch Loss= 12.7493, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1033 of 3756 , Minibatch Loss= 16.9590, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1034 of 3756 , Minibatch Loss= 16.2862, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1035 of 3756 , Minibatch Loss= 15.8414, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1036 of 3756 , Minibatch Loss= 15.1954, TRAIN ACCURACY= 0.000 Epoch 79, Minibatch 1037 of 3756 , Minibatch Loss= 14.0057, TRAIN ACCURACY= 0.000
This is my process: For one person, I extract features of all its all sound files and stack all frames of all all sound files . I only chose 1000 3D training samples (20,80,40) of each person.
You are in epoch 80!! For any training data, it should slightly go to convergence level so far! I don't think it's related to the code! Although you are using my code. I would say check the implementation in detail. Like if you are missing something or you modified anything by mistake. Also check learning rate too. It's weird to my eyes. Please stay in touch. I will do my best to help.
--num_epochs=1000 --batch_size=128 I usd LibriSpeech dataset with about 2500 pesons. For each person ,I chose 1000 training sample. I have trained it for three days using gpu. But the acc is always zero. Must I change some training parameters?
@astorfi Can your share your pretrainded mode?
@wuqiangch Unfortunately not ... Because it's been trained on a non-publicly available dataset.
@wuqiangch Do you use batch normalization? What about data standardization? What is the initial learning rate?
@astorfi I dont change anything in the model as you provide. I dont use data standardization ,only using the orignal mfec feature. I dont change the initial learning rate(you set it 10).
@wuqiangch Actually I lost the thread ... The point is, regardless of my code, and even the architecture me and you are using, with such a huge training data, you should be able to get to the point of convergence at least for training even if you get 0 percent accuracy for evaluation!
I do not know if you are creating your data in a correct way. Even with that, I believe training accuracy should be increasing in any sense.
@astorfi ,I used the feature(two persons) you provided.-num_epochs=1000 --batch_size=32 Epoch 1000, Minibatch 1 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 2 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 3 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 4 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 5 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 6 of 25 , Minibatch Loss= 0.1261, TRAIN ACCURACY= 60.000 Epoch 1000, Minibatch 7 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 8 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 9 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 10 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 11 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 12 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 13 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 14 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 15 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 16 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 17 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 18 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 19 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 20 of 25 , Minibatch Loss= 0.2102, TRAIN ACCURACY= 100.000 Epoch 1000, Minibatch 21 of 25 , Minibatch Loss= 0.0210, TRAIN ACCURACY= 10.000 Epoch 1000, Minibatch 22 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 23 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 24 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 Epoch 1000, Minibatch 25 of 25 , Minibatch Loss= 0.0000, TRAIN ACCURACY= 0.000 TESTING after finishing the training on: epoch 1000 Test Accuracy 1000, Mean= 75.0000, std= 43.301
You std is a lot ... You may have to do mean subtraction and standardization of the data ... It's not defined in the code by default.
@astorfi I did mean subtraction and standardization of the data .But it not work too.You can use the feature(two persons) you provided to train the model and show me the result. Thanks!
@wuqiangch Please run the recently updated run.sh file to see the results. Moreover, regardless of my architecture, you should be able to modify the hyperparameters to get almost perfect results at least on training. It's just a softmax!
hello! I could not find the pipeline preparation example. Do you mind telling me where is it?
thank you in advance
Can you provide input pipeline? Thanks!