Open KatJHuang opened 6 years ago
Note to self: it's also help to generate a confusion matrix. Flag: add this feature.
Experimented with transfer learning on the Inception v3 model (https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf). It's got some really nice architecture that takes away the conundrum of choosing sizes for convolution kernels. Test accuracy is around 50% and training hikes up to almost 90% using up to the second last layer of the model. A bit more decent than the MobileNet experiment. Legends (other research papers) say they were able get around ~80% test accuracy with other architectures such as VGG16/19, ResNet, etc.
Then to test the hypothesis that using less layers prevents irrelevant high level feature learning, I conducted another experiment that lumped our custom NNs on top of the first pooling layer. This actually worsened the prediction accuracy. The cross entropy loss was not converging. This can be attributed to a gargantuan number of parameters resulting from the lower-level convolution/pooling layers. Specifically the first pooling layer unravels to around 55k parameters (192 kernels of 17x17 pixels) which is rather demanding for our dataset that's around 1k samples. So this approach leads us into underfitting.
(The above experiment was nice coding challenge nonetheless. The python script provided by Tensorflow contained some ~1k lines of code so it's cool to get sync'ed with it. Then it turned out I only needed to add two additional line of code to achieve my purpose.)
Sidenotes on this experiment:
mixed/tower_2/pool
To seek improvement from other avenues, I ran into a few new papers:
And it caught my attention that apparently the colors in which our spectrograms are saved have impact on the accuracy of CNN models. Paper 1 actually found RGB produced higher accuracy than theses colors individually or the greyscale image. This is curious. Maybe my grayscale spectrum is the culprit. Will conduct another experiment to verify if this is true.
(Sidenote: the discussion in the Inception v3 paper has got lots to say about the sparsity of features and how convolution reduces it. Reminds me of 367, like greatly. Perhaps time to get a good foundation)
Meeting minutes from Oct 30
Oct 30 ToDo:
As discussed in the Oct 30 meeting, the current CNN model has the problem of overfitting: validation accuracy plateaus as training accuracy keeps climbing up. I will:
INFO:tensorflow:2018-11-12 22:41:57.811166: Step 3990: Train accuracy = 47.0%
INFO:tensorflow:2018-11-12 22:41:57.811314: Step 3990: Cross entropy = 1.297977
INFO:tensorflow:2018-11-12 22:41:57.895010: Step 3990: Validation accuracy = 54.0% (N=100)
INFO:tensorflow:2018-11-12 22:41:58.691099: Step 3999: Train accuracy = 47.0%
INFO:tensorflow:2018-11-12 22:41:58.691244: Step 3999: Cross entropy = 1.309510
INFO:tensorflow:2018-11-12 22:41:58.774995: Step 3999: Validation accuracy = 45.0% (N=100)
INFO:tensorflow:Final test accuracy = 48.2% (N=139)
Also discussed is how much data is needed for the CNN architecture. I will try to:
Experiment with different hyper-parameters of the current CNN model and analyze their impacts on the classification accuracy
https://arxiv.org/ftp/arxiv/papers/1707/1707.09917.pdf Check this paper, I think it worth a read
CNN code is on https://github.com/KatJHuang/tensorflow-for-poets-2/tree/prevent-overfit-w-dropout
You can git clone https://github.com/KatJHuang/tensorflow-for-poets-2.git
, and git checkout prevent-overfit-w-dropout
to get all the relevant source files. The branch contains scripts to serialize spectrograms data, run training, and make a prediction.
Before you run training, make sure your spectrograms are already serialized into TF records file. You can serialize the labelled spectrogram images using the image2tfrecords.py
script as follows:
python image2tfrecords.py $LABELLED_SPECTROGRAM_DIR $SERIALIZATION_OUTPUT_DIR
-> For obtaining a folder of labelled spectrogram images, please consult this comment from above.
retrain.py
scriptInput Spectrograms serialized into the TF records format
Output
aud_emo/models/
folderCommand
Assuming your pwd
is the tensorflow-for-poets-2/scripts/
folder, where the retrain.py
script is located, you can run
python retrain.py
--bottleneck_dir=aud_emo/bottlenecks
--model_dir=aud_emo/models/
--summaries_dir=aud_emo/training_summaries/inception_v3_standard
--output_graph=aud_emo/retrained_graph.pb
--output_labels=aud_emo/retrained_labels.txt
--architecture=inception_v3
--image_records_dir=$SERIALIZATION_OUTPUT_DIR
--how_many_training_steps=1000
And training should start. If you are interested in the internals of the training process, you can delve into the retrain.py
code. The relevant code for producing result adapted to our purpose is in the function where the extra fully connected layer is added.
This thread is created to track the progress in the CNN/transfer learning research.
Found an implementation/tutorial on CNN transfer learning and experimented it with our RAVDESS dataset. With no modification to their implementation, we can achieve an accuracy at 42.4%.
As a quick test, this accuracy is decent. But further tuning on the model would definitely improve the result. The current model is using all conv/pooling layers of the MobileNet model, which would suit better for real world images as opposed to our B&W image of wavy lines. These many layers are better at the modeling the complicated structures in real images but would be too much for our purposes. We can try using fewer layers and get better modeling behavior.
Tutorial link: https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/#0 PLEASE TAKE A LOOK AT IT^ It's very helpful in knowing what is happening here :)
Result
Dataset: RAVDESS Test accuracy: 42.4%
INFO:tensorflow:Final test accuracy = 42.4% (N=139)
(on 4000 iterations)train_report_no_mod.pdf
Data preparation
Steps to reproduce
(assuming RAVDESS speech dataset is already downloaded):
python voiced_segment_extractor.py $aggressive_lvl --mode dir $input_dir --output $output_dir
python waveform2img.py $input_dir $output_dir