Open rightnknow opened 5 years ago
Before I purpose a transferred learning solution that use cnn to extract the features and then use rnn to do the classification. Two structure now,
Current design architecture is CNN+RNN
Here we use CNN as an feature extractor: We can do this by using existing pre-trained CNN that targeted on audio features: Detail spec can be view here at https://github.com/tensorflow/models/tree/master/research/audioset
The output from this CNN extractor contains with the following specs: window size of 25 ms, a window hop of 10 ms with following features A spectrogram , Short-Time Fourier Transform , A mel spectrogram 64 bands A stabilized log mel spectrogram: log(mel-spectrum + 0.01)
0.96 second is a frame, where each example covers 64 mel bands and 96 frames of 10 ms each.
Here's the model loss plot:
Here's the accuracy plot:
Here's the final test accuracy: On test set 231/231 [==============================] - 0s 442us/step test accuracy is 0.5064935072675928
For a quick summary, Input : wav format audio file that should last at least 1 second. Output: feature size of 128 for each second, call a frame, there can be multiple frames for a single audio clip
Then the train set go through PCA and whitening process. Train set dimension is [number of sample, number of frame, number of feature] Then the data is passed into 3 layers of LSTM of size 64, follow by a fully connect layer with activation function of softmax
Then there's tuning part, we aim to reduce the validation loss by tuning the following parameters: 1.Number of layers 2.Number of cell in one layer 3.Dropout radio 4.Using/not using bidirectional network 5.Early stop
Current configuration:
1.Number of layers 2 2.Number of cell in one layer 32 3.Dropout radio 0.3 4.Using/not using bidirectional network No 5.Early stop
result loss
result validation accuracy
Test accuracy 231/231 [==============================] - 3s 14ms/step test accuracy is 0.5411255418996275
Currently the design ideas are:
1.Use CNN and add another layer at the top: Q: Determine whether we should create our own 1d CNN or use pretrained CNN, unfortunately most CNN available all for speech identification. We can add another layer at the top to tune it but the result is not guaranteed. Pre-trained CNN Google deep speech: https://github.com/mozilla/DeepSpeech/wiki#frequently-asked-questions
Use RNN only https://arxiv.org/abs/1701.08071
construct an 1D constitutional neural network to learn and extract features from the raw audio input. Then put it into RNN, without going through Connectionist temporal classification.
These methods all uses Connectionist temporal classification, I don't quite understand it yet but will try to understand it asap. Here's the paper: https://www.cs.toronto.edu/~graves/icml_2006.pdf A Chinese version of summery of the paper.