Closed lzj520 closed 5 years ago
What is the recognition effect of Chinese
whether it is normal or not
This is not normal. The difference between your training and validation error is increasing as you train your model. This means that your model is overfitting and would not be able to recognize or predict the emotions on the audio files it has never encountered before
I trained the 16000 sample rate audio, Chinese corpus removal,duration=2.5,offset=0.5, and obtained the results. If the length of the audio is uncertain, how can I accurately obtain the accuracy of the model? If duration=2.5,offset=0.5, ValueError: Error when checking input may occur:Expected conv1d_1_input to have shape (32, 1) but got array with shape (24, 1)
You can try to increase the sampling rate to anything greater than 16000. This way you can extract more number of features from the audio files. As you have clipped your audio files to 2.5 sec, try increasing it to maybe 3 sec or less than 2.5 as you described some files might be smaller. If some of the audio files are smaller then the features would be 0 for that particular audio file.
The ValueError that you are facing is due to some dimension error in your input data. Check the dimension of the test and training data.
I tried duration=1,sr=22050*2, and recognized emotions as anger and happiness. The recognition effect was poor, so how to improve?
duration = 1 is not a good option. If you are just providing it 1 sec of audio, it is difficult to extract valuable features. I would suggest you to have atleast 2-3 seconds. This way the model has features where it can distinguish between different emotions.
How to train the corpus not enough 2 seconds will make mistakes
Are you saying that the corpus you have consist of audio files less than 2 secs long?
Yes, I use Chinese corpus, which contains audio files that are not 2 seconds slow. The training result is not ideal
Duration =2, ValueError: Error when checking input: expected conv1d_1_input to have shape (173, 1) but got array with shape (155, 1) Is there any way to adjust it
I tried to modify the parameters and got both emotions right 73 percent of the time, but always got the following results when using the audio test
Do not modify your code training results as follows
Yeah, the shape of the input is different than what you are passing it in conv_1d layer. Refer to cell number 30 and 45, there you can see that I checked the shape of my training set and then supplied the same shape as input to the conv_1d layer in the 45th cell.
When you are testing it out on the live predictions, the audio file you passed as the input doesn't have too many features after extraction and its just giving two features. One starting with 1 and other with 0.
Sorry, I don't understand the language. Can you post it in English.