[Progress Report] Emotion Detection with CNN transfer learning approach

KatJHuang commented 5 years ago

This thread is created to track the progress in the CNN/transfer learning research.

Found an implementation/tutorial on CNN transfer learning and experimented it with our RAVDESS dataset. With no modification to their implementation, we can achieve an accuracy at 42.4%.

As a quick test, this accuracy is decent. But further tuning on the model would definitely improve the result. The current model is using all conv/pooling layers of the MobileNet model, which would suit better for real world images as opposed to our B&W image of wavy lines. These many layers are better at the modeling the complicated structures in real images but would be too much for our purposes. We can try using fewer layers and get better modeling behavior.

Tutorial link: https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/#0 PLEASE TAKE A LOOK AT IT^ It's very helpful in knowing what is happening here :)

Result

Dataset: RAVDESS Test accuracy: 42.4% INFO:tensorflow:Final test accuracy = 42.4% (N=139) (on 4000 iterations)

train_report_no_mod.pdf

Data preparation

Trim silence with voiced_segment_extractor.py
Convert waveform to image representation, in this case, melspectrograms and save them to a directory structure supported by the implementation showed in the tutorial above, with waveform2img.py The above scripts can be found in my experimentation repo.

Steps to reproduce

(assuming RAVDESS speech dataset is already downloaded):

Trim silence and save the shortened audio file: python voiced_segment_extractor.py $aggressive_lvl --mode dir $input_dir --output $output_dir
Convert trimmed audio waveforms into spectrograms and group them by their emotion labels: python waveform2img.py $input_dir $output_dir

Get the transfer learning implementation from Tensorflow repo, make NO modification to the python code, and run training commands verbatim (except using my own data folders from step 2 of course).

IMAGE_SIZE=224
ARCHITECTURE="mobilenet_0.50_${IMAGE_SIZE}"
python -m scripts.retrain \
--bottleneck_dir=aud_emo/bottlenecks \
--model_dir=aud_emo/models/ \
--summaries_dir=aud_emo/training_summaries/"${ARCHITECTURE}" \
--output_graph=aud_emo/retrained_graph.pb \
--output_labels=aud_emo/retrained_labels.txt \
--architecture="${ARCHITECTURE}" \
--image_dir=/Users/catherinehuang/Desktop/CS/ece496/audio_emotion_analysis/img_rep_lv2/

Note: we are running their code so please clone their repo :)

KatJHuang commented 5 years ago

Note to self: it's also help to generate a confusion matrix. Flag: add this feature.

KatJHuang commented 5 years ago

Experimented with transfer learning on the Inception v3 model (https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf). It's got some really nice architecture that takes away the conundrum of choosing sizes for convolution kernels. Test accuracy is around 50% and training hikes up to almost 90% using up to the second last layer of the model. A bit more decent than the MobileNet experiment. Legends (other research papers) say they were able get around ~80% test accuracy with other architectures such as VGG16/19, ResNet, etc.

Then to test the hypothesis that using less layers prevents irrelevant high level feature learning, I conducted another experiment that lumped our custom NNs on top of the first pooling layer. This actually worsened the prediction accuracy. The cross entropy loss was not converging. This can be attributed to a gargantuan number of parameters resulting from the lower-level convolution/pooling layers. Specifically the first pooling layer unravels to around 55k parameters (192 kernels of 17x17 pixels) which is rather demanding for our dataset that's around 1k samples. So this approach leads us into underfitting.

(The above experiment was nice coding challenge nonetheless. The python script provided by Tensorflow contained some ~1k lines of code so it's cool to get sync'ed with it. Then it turned out I only needed to add two additional line of code to achieve my purpose.)

Sidenotes on this experiment:

You can explore the architecture of Inception v3 through the graph feature of TensorBoard. The layer I chose was mixed/tower_2/pool
TODO: upload my modified script

To seek improvement from other avenues, I ran into a few new papers:

And it caught my attention that apparently the colors in which our spectrograms are saved have impact on the accuracy of CNN models. Paper 1 actually found RGB produced higher accuracy than theses colors individually or the greyscale image. This is curious. Maybe my grayscale spectrum is the culprit. Will conduct another experiment to verify if this is true.

KatJHuang commented 5 years ago

(Sidenote: the discussion in the Inception v3 paper has got lots to say about the sparsity of features and how convolution reduces it. Reminds me of 367, like greatly. Perhaps time to get a good foundation)

KatJHuang commented 5 years ago

Meeting minutes from Oct 30

there is overfitting in training with both the Inception v3 and MobileNet. Train accuracy keeps increasing while validation plateaus. Will look up strategies to prevent overfitting.
Size of data vs. number of parameters. Need proof for their relationship.

KatJHuang commented 5 years ago

Oct 30 ToDo:

As discussed in the Oct 30 meeting, the current CNN model has the problem of overfitting: validation accuracy plateaus as training accuracy keeps climbing up. I will:

Learn mechanisms of overfitting in the context of neural networks: weights start adapting to noises specific to the training data.
Research how overfitting can be reduced and implement such methods: regularization, early stopping, and drop out

With dropout applied to final dense layer:

Console output:

INFO:tensorflow:2018-11-12 22:41:57.811166: Step 3990: Train accuracy = 47.0%
INFO:tensorflow:2018-11-12 22:41:57.811314: Step 3990: Cross entropy = 1.297977
INFO:tensorflow:2018-11-12 22:41:57.895010: Step 3990: Validation accuracy = 54.0% (N=100)
INFO:tensorflow:2018-11-12 22:41:58.691099: Step 3999: Train accuracy = 47.0%
INFO:tensorflow:2018-11-12 22:41:58.691244: Step 3999: Cross entropy = 1.309510
INFO:tensorflow:2018-11-12 22:41:58.774995: Step 3999: Validation accuracy = 45.0% (N=100)
INFO:tensorflow:Final test accuracy = 48.2% (N=139)

Also discussed is how much data is needed for the CNN architecture. I will try to:
- Understand mathematically how the size of data is related to number of parameters to train in a NN model
Experiment with different hyper-parameters of the current CNN model and analyze their impacts on the classification accuracy

rightnknow commented 5 years ago

https://arxiv.org/ftp/arxiv/papers/1707/1707.09917.pdf Check this paper, I think it worth a read

KatJHuang commented 5 years ago

CNN code is on https://github.com/KatJHuang/tensorflow-for-poets-2/tree/prevent-overfit-w-dropout You can git clone https://github.com/KatJHuang/tensorflow-for-poets-2.git, and git checkout prevent-overfit-w-dropout to get all the relevant source files. The branch contains scripts to serialize spectrograms data, run training, and make a prediction.

Before you run training, make sure your spectrograms are already serialized into TF records file. You can serialize the labelled spectrogram images using the image2tfrecords.py script as follows: python image2tfrecords.py $LABELLED_SPECTROGRAM_DIR $SERIALIZATION_OUTPUT_DIR

-> For obtaining a folder of labelled spectrogram images, please consult this comment from above.

To run training, use the `retrain.py` script

Input Spectrograms serialized into the TF records format

Output

Trained model saved as a .pb file in the aud_emo/models/ folder
Training summaries and whatnot (these are important only if you want to halt the training midways and pick it up later)

Command Assuming your pwd is the tensorflow-for-poets-2/scripts/ folder, where the retrain.py script is located, you can run

python retrain.py
--bottleneck_dir=aud_emo/bottlenecks
--model_dir=aud_emo/models/
--summaries_dir=aud_emo/training_summaries/inception_v3_standard
--output_graph=aud_emo/retrained_graph.pb
--output_labels=aud_emo/retrained_labels.txt
--architecture=inception_v3
--image_records_dir=$SERIALIZATION_OUTPUT_DIR
--how_many_training_steps=1000

And training should start. If you are interested in the internals of the training process, you can delve into the retrain.py code. The relevant code for producing result adapted to our purpose is in the function where the extra fully connected layer is added.

capstone496 / SpeechSentiments