@article{papakostas2018speech, title={Speech-Music Discrimination Using Deep Visual Feature Extractors}, author={Papakostas, Michalis and Giannakopoulos, Theodoros}, journal={Expert Systems with Applications}, year={2018}, publisher={Elsevier} }
This project describes a new approach to the very traditional problem of Speech-Music Discrimination. According to our knowledge, the proposed method, provides state-of-the-art results on the task. We employ a Deep Convolutional Neural Network (CNN) and we offer a compact framework to perform segmentation and binary (Speech/Music) classification, by exploing the benefits of transfering knowledge from pretrained architectures on Imagenet. Our method is unchained from traditional audio features, which offer inferior results on the task. Instead it exploits the highly invariant features produced by CNNs and opperates on pseudocolored RGB or grayscale frequency-images, which represent audio segments.
*Dataset included speeh-only, music-only and speech-music overlaping audio samples - for further details loook at the paper
The repository consists of the following modules:
Dependencies
* Installation instructions offered in detail on the above links
Add Caffe to your working dir
or add pycaffe to your .bashrc for directory independent access
open .bashrc file located at your home directory In a terminal type:
cd ~
to navigate to your home directoryls -a
to see the file listednano .bashrc
to open the file in terminalexport PYTHONPATH=$PYTHONPATH:"/home/--myPathToCaffe--/caffe/python"
, where --myPathToCaffe-- is the path to the caffe library as it appears in your local machine
i.e.: export PYTHONPATH=$PYTHONPATH:"/home/michalis/Liraries/caffe/python"
source ~/.bashrc
to update your source file Convert your audio files into pseudocolored RGB or grayscale spectrogram images using generateSpectrograms.py TO BE UPDATED a)How to run, b)How to set segmentation parameters c) How the output looks like
Split the spectrogram images into train and test as shown in Fig1:
Data should be pseudo-colored RGB spectrogram images of size 227x227 as shown in Fig2
or grayscale spectrogram images of size 200x200 as shown in Fig3
Train a CNN
Train
Training can be done either by training a new network from sratch or by finetuning a pretrained architecture.
The pretrained model used in the paper for fine-tuning is the caffe_imagenet_hyb2_wr_rc_solver_sqrt_iter_310000 initially proposed in Donahue, Jeffrey, et al. "Long-term recurrent convolutional networks for visual recognition and description." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. To exploit the weight initialization of the pretrained model use the CNN architecture shown in SpeechMusic_RGB.prototxt.
If you wish to deploy the smaller CNN architecture that operates on grayscale images you should use the CNN architecture shown in SpeechMusic_GRAY.prototxt. This model was trained from scratch without weight initialization.
Train from scratch:
python trainCNN.py <architecture_file>.prototxt <path_to_train_data_root_foler> <path_to_test_data_root_foler> <snapshot_prefix> <total_number_of_iterations>
Finetune pretrained network:
python trainCNN.py <architecture_file>.prototxt <path_to_train_data_root_foler> <path_to_test_data_root_foler> <snapshot_prefix> <total_number_of_iterations> --init <pretrained_network>.caffemodel --init_type fin
Resume Training:
python trainCNN.py <architecture_file>.prototxt <path_to_train_data_root_foler> <path_to_test_data_root_foler> <snapshot_prefix> <total_number_of_iterations> --init <pretrained_network>.solverstate --init_type res
For more details about modifying other learning parameters (i.e learning rate, step size etc.) type:
python trainCNN.py -h
Train HMM
python ClassifyWav.py trainHMM <path_to_test_data> <hmm_model_name> <core_classification_method> <trained_network> <classification_method>
*This applies after having a trained CNN
Change trainCNN.py, Line:9, to caffe.set_mode_gpu()
to support GPU implementation
python ClassifyWav.py evaluate <path_to_test_wav_files> <trained_network>.caffemodel <classification_method> <classification_type_flag> ""
python ClassifyWav.py evaluate <path_to_test_wav_files> <trained_network>-5000.caffemodel <core_classification_method> <classification_type_flag> <hmm_model_name>
Change ClassifyWav.py, Line:17, to caffe.set_mode_gpu()
to support GPU implementation
Generate Spectrogram Images:
Train from scratch:
python trainCNN.py SpeechMusic_RGB.prototxt Train Test myOutput 4000
Finetune pretrained network (train and test paths are according to Fig1):
python trainCNN.py SpeechMusic_RGB.prototxt Train Test myOutput 1000 --init caffe_imagenet_hyb2_wr_rc_solver_sqrt_iter_310000.caffemodel --init_type fin
Resume training from pretrained network (train and test paths are according to Fig1):
python trainCNN.py SpeechMusic_RGB.prototxt Train Test my_new_Output 2000 --init myOutput.solverstate --init_type res
Evaluate trained CNN on .wav file/s without preprosesing:
python ClassifyWav.py evaluate Data/testWavs CNN-SM-5000.caffemodel cnn 0 ""
Evaluate trained CNN on .wav file/s with preprosesing:
python ClassifyWav.py evaluate Data/testWavs CNN-SM-5000.caffemodel cnn 1 ""
Train an HMM after applying median filtering:
python ClassifyWav.py trainHMM Data/testWavs hmm1 cnn CNN-SM-5000.caffemodel 1
Test using pretrained HMM:
python ClassifyWav.py evaluate Data/testWavs CNN-SM-5000.caffemodel cnn 2 hmm1
A pretrained model on the task using pseudo-colored RGB images along with the solverstate can be found here
We provide a new method for the task of Speech/Music Discrimination using Convolutional Neural Networks. The main contributions of this work are the following:
A compact framework for:
A big dataset on long audio streams (more than 10h) for the task of speech music discrimination. The dataset is provided in the form of spectrograms.
Two different pretrained CNN architectures that can be used for weight initialization for other binary classification tasks.
To our knowledge our method provides state-of-the-art results on the task
If your found our project usefull please cite the following referenced publications:
CNNs:Speech-Music-Discrimination @article{papakostas2018speech, title={Speech-Music Discrimination Using Deep Visual Feature Extractors}, author={Papakostas, Michalis and Giannakopoulos, Theodoros}, journal={Expert Systems with Applications}, year={2018}, publisher={Elsevier} }
PyAudioAnalysis @article{giannakopoulos2015pyaudioanalysis, title={pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis}, author={Giannakopoulos, Theodoros}, journal={PloS one}, volume={10}, number={12}, year={2015}, publisher={Public Library of Science} }
Caffe Framework @article{jia2014caffe, Author = {Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor}, Journal = {arXiv preprint arXiv:1408.5093}, Title = {Caffe: Convolutional Architecture for Fast Feature Embedding}, Year = {2014} }
If you used the pretrained network _caffe_imagenet_hyb2_wr_rc_solver_sqrt_iter310000 for your experiments, please also cite:
@inproceedings{donahue2015long, title={Long-term recurrent convolutional networks for visual recognition and description}, author={Donahue, Jeffrey and Anne Hendricks, Lisa and Guadarrama, Sergio and Rohrbach, Marcus and Venugopalan, Subhashini and Saenko, Kate and Darrell, Trevor}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, pages={2625--2634}, year={2015} }