capstone496 / SpeechSentiments

1 stars 1 forks source link

[Progress Report] Emotion Detection with CNN transfer learning approach #5

Open KatJHuang opened 5 years ago

KatJHuang commented 5 years ago

This thread is created to track the progress in the CNN/transfer learning research.


Found an implementation/tutorial on CNN transfer learning and experimented it with our RAVDESS dataset. With no modification to their implementation, we can achieve an accuracy at 42.4%.

As a quick test, this accuracy is decent. But further tuning on the model would definitely improve the result. The current model is using all conv/pooling layers of the MobileNet model, which would suit better for real world images as opposed to our B&W image of wavy lines. These many layers are better at the modeling the complicated structures in real images but would be too much for our purposes. We can try using fewer layers and get better modeling behavior.

Tutorial link: https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/#0 PLEASE TAKE A LOOK AT IT^ It's very helpful in knowing what is happening here :)

Result

Dataset: RAVDESS Test accuracy: 42.4% INFO:tensorflow:Final test accuracy = 42.4% (N=139) (on 4000 iterations)

image image

train_report_no_mod.pdf

Data preparation

  1. Trim silence with voiced_segment_extractor.py
  2. Convert waveform to image representation, in this case, melspectrograms and save them to a directory structure supported by the implementation showed in the tutorial above, with waveform2img.py The above scripts can be found in my experimentation repo.

Steps to reproduce

(assuming RAVDESS speech dataset is already downloaded):

  1. Trim silence and save the shortened audio file: python voiced_segment_extractor.py $aggressive_lvl --mode dir $input_dir --output $output_dir
  2. Convert trimmed audio waveforms into spectrograms and group them by their emotion labels: python waveform2img.py $input_dir $output_dir image
  3. Get the transfer learning implementation from Tensorflow repo, make NO modification to the python code, and run training commands verbatim (except using my own data folders from step 2 of course).
    IMAGE_SIZE=224
    ARCHITECTURE="mobilenet_0.50_${IMAGE_SIZE}"
    python -m scripts.retrain \
    --bottleneck_dir=aud_emo/bottlenecks \
    --model_dir=aud_emo/models/ \
    --summaries_dir=aud_emo/training_summaries/"${ARCHITECTURE}" \
    --output_graph=aud_emo/retrained_graph.pb \
    --output_labels=aud_emo/retrained_labels.txt \
    --architecture="${ARCHITECTURE}" \
    --image_dir=/Users/catherinehuang/Desktop/CS/ece496/audio_emotion_analysis/img_rep_lv2/
KatJHuang commented 5 years ago

Note to self: it's also help to generate a confusion matrix. Flag: add this feature.

KatJHuang commented 5 years ago

Experimented with transfer learning on the Inception v3 model (https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf). It's got some really nice architecture that takes away the conundrum of choosing sizes for convolution kernels. Test accuracy is around 50% and training hikes up to almost 90% using up to the second last layer of the model. A bit more decent than the MobileNet experiment. Legends (other research papers) say they were able get around ~80% test accuracy with other architectures such as VGG16/19, ResNet, etc.

Then to test the hypothesis that using less layers prevents irrelevant high level feature learning, I conducted another experiment that lumped our custom NNs on top of the first pooling layer. This actually worsened the prediction accuracy. The cross entropy loss was not converging. This can be attributed to a gargantuan number of parameters resulting from the lower-level convolution/pooling layers. Specifically the first pooling layer unravels to around 55k parameters (192 kernels of 17x17 pixels) which is rather demanding for our dataset that's around 1k samples. So this approach leads us into underfitting.

(The above experiment was nice coding challenge nonetheless. The python script provided by Tensorflow contained some ~1k lines of code so it's cool to get sync'ed with it. Then it turned out I only needed to add two additional line of code to achieve my purpose.)

Sidenotes on this experiment:

To seek improvement from other avenues, I ran into a few new papers:

And it caught my attention that apparently the colors in which our spectrograms are saved have impact on the accuracy of CNN models. Paper 1 actually found RGB produced higher accuracy than theses colors individually or the greyscale image. This is curious. Maybe my grayscale spectrum is the culprit. Will conduct another experiment to verify if this is true.

KatJHuang commented 5 years ago

(Sidenote: the discussion in the Inception v3 paper has got lots to say about the sparsity of features and how convolution reduces it. Reminds me of 367, like greatly. Perhaps time to get a good foundation)

KatJHuang commented 5 years ago

Meeting minutes from Oct 30

KatJHuang commented 5 years ago

Oct 30 ToDo:

rightnknow commented 5 years ago

https://arxiv.org/ftp/arxiv/papers/1707/1707.09917.pdf Check this paper, I think it worth a read

KatJHuang commented 5 years ago

CNN code is on https://github.com/KatJHuang/tensorflow-for-poets-2/tree/prevent-overfit-w-dropout You can git clone https://github.com/KatJHuang/tensorflow-for-poets-2.git, and git checkout prevent-overfit-w-dropout to get all the relevant source files. The branch contains scripts to serialize spectrograms data, run training, and make a prediction.

Before you run training, make sure your spectrograms are already serialized into TF records file. You can serialize the labelled spectrogram images using the image2tfrecords.py script as follows: python image2tfrecords.py $LABELLED_SPECTROGRAM_DIR $SERIALIZATION_OUTPUT_DIR

-> For obtaining a folder of labelled spectrogram images, please consult this comment from above.

To run training, use the retrain.py script

Input Spectrograms serialized into the TF records format

Output

Command Assuming your pwd is the tensorflow-for-poets-2/scripts/ folder, where the retrain.py script is located, you can run

python retrain.py
--bottleneck_dir=aud_emo/bottlenecks
--model_dir=aud_emo/models/
--summaries_dir=aud_emo/training_summaries/inception_v3_standard
--output_graph=aud_emo/retrained_graph.pb
--output_labels=aud_emo/retrained_labels.txt
--architecture=inception_v3
--image_records_dir=$SERIALIZATION_OUTPUT_DIR
--how_many_training_steps=1000

And training should start. If you are interested in the internals of the training process, you can delve into the retrain.py code. The relevant code for producing result adapted to our purpose is in the function where the extra fully connected layer is added.