[Progress Report]Research on CNN+RNN

rightnknow commented 5 years ago

Currently the design ideas are:

1.Use CNN and add another layer at the top: Q: Determine whether we should create our own 1d CNN or use pretrained CNN, unfortunately most CNN available all for speech identification. We can add another layer at the top to tune it but the result is not guaranteed. Pre-trained CNN Google deep speech: https://github.com/mozilla/DeepSpeech/wiki#frequently-asked-questions

Use RNN only https://arxiv.org/abs/1701.08071
construct an 1D constitutional neural network to learn and extract features from the raw audio input. Then put it into RNN, without going through Connectionist temporal classification.

These methods all uses Connectionist temporal classification, I don't quite understand it yet but will try to understand it asap. Here's the paper: https://www.cs.toronto.edu/~graves/icml_2006.pdf A Chinese version of summery of the paper.

rightnknow commented 5 years ago

Before I purpose a transferred learning solution that use cnn to extract the features and then use rnn to do the classification. Two structure now,

pre-processe data with rnn
1d cnn and pure learning
combination

rightnknow commented 5 years ago

Current design architecture is CNN+RNN

Here we use CNN as an feature extractor: We can do this by using existing pre-trained CNN that targeted on audio features: Detail spec can be view here at https://github.com/tensorflow/models/tree/master/research/audioset

The output from this CNN extractor contains with the following specs: window size of 25 ms, a window hop of 10 ms with following features A spectrogram , Short-Time Fourier Transform , A mel spectrogram 64 bands A stabilized log mel spectrogram: log(mel-spectrum + 0.01)

0.96 second is a frame, where each example covers 64 mel bands and 96 frames of 10 ms each.

Here's the model loss plot:

qq 20190111124211

Here's the accuracy plot: qq 20190111124200

Here's the final test accuracy: On test set 231/231 [==============================] - 0s 442us/step test accuracy is 0.5064935072675928

For a quick summary, Input : wav format audio file that should last at least 1 second. Output: feature size of 128 for each second, call a frame, there can be multiple frames for a single audio clip

Then the train set go through PCA and whitening process. Train set dimension is [number of sample, number of frame, number of feature] Then the data is passed into 3 layers of LSTM of size 64, follow by a fully connect layer with activation function of softmax

rightnknow commented 5 years ago

Then there's tuning part, we aim to reduce the validation loss by tuning the following parameters: 1.Number of layers 2.Number of cell in one layer 3.Dropout radio 4.Using/not using bidirectional network 5.Early stop

Current configuration:

1.Number of layers 2 2.Number of cell in one layer 32 3.Dropout radio 0.3 4.Using/not using bidirectional network No 5.Early stop

result loss qq 20190116192136

result validation accuracy

qq 20190116192105

Test accuracy 231/231 [==============================] - 3s 14ms/step test accuracy is 0.5411255418996275

capstone496 / SpeechSentiments

[Progress Report]Research on CNN+RNN #9