adbailey1 / DepAudioNet_reproduction

Reproduction of DepAudioNet by Ma et al. {DepAudioNet: An Efficient Deep Model for Audio based Depression Classification,(https://dl.acm.org/doi/10.1145/2988257.2988267), AVEC 2016}
63 stars 15 forks source link

Will the training set be too small when set crop to be true? #5

Open kingformatty opened 3 years ago

kingformatty commented 3 years ago

Hi Andrew, I have played with this set of code for a while and made some modifications by myself. I got a question regarding the training set size for one exp-runthrough. If we set the config.EXPERIMENT_DETAILS['CROP'] to be true, then there will be only 1926 segments in the training set for each model. Where 1926 = 107 (the number of training folder/utterance) * 18 (the shortest number of segments among all utterances). However, there are actually over 10000 segments available for the entire training set. And if we set config.EXPERIMENT_DETAILS['SUB_SAMPLE_ND_CLASS'] to be true, then the training set size shrinks to 468 + 468 where 468/2 = 234 is the shortest length of zero-male, zero-female, one-male, and one-female. Until now, the training set size has already shrink into less than 1/10. Is there any reason to do so ?

By the way, I tried to use all Dep segments and subsample ND without crop, which results in 4249 + 4249 samples, but the performance seems suffering from extremely weak for Depression class. I haven't figured out if it's the problem with my implementation of it's natural.

Looking forward to your reply. Thank you very much.

Best Jinhan

adbailey1 commented 3 years ago

Hi Jinhan,

Thanks for the question. In short, I believe that even with all the samples included, this is a small dataset, which is a definite challenge.

The reasoning behind cropping and sub-sampling was mainly to follow the approach of Ma et al. in their model DepAudioNet. We can see that every interview file has a different length. If we don't tackle this in some way, the dataset may learn from overrepresentations of individuals in the dataset. Ma et al. chose to crop every file to the same length.

We can also see that there is a class imbalance in the dataset (more non-depressed than depressed). Again, there are several ways of handling this problem, but Ma et al. chose to go with random sub-sampling of the non-depressed files in the dataset. After doing this, we are left with an equal number of instances of depressed and non-depressed and every instance is the same length. I have already included an alternate way to handle the class imbalance by using class weights. So, if you choose not to sub-sample, you will want to use class weights to reduce the impact of each non-depressed instance.

I believe you may be working with a slightly out of date implementation just as an aside, I have included a sample from one of my log files and you can see that that setting config.EXPERIMENT_DETAILS['SUB_SAMPLE_ND_CLASS']=True and config.EXPERIMENT_DETAILS['CROP']=True gives 504 non-depressed and depressed instances not 468.

"Crop Partitions: {'fndep': [2084, 27], 'fdep': [2084, 17], 'mndep': [2084, 49], 'mdep': [2084, 14]} The dimensions of the train features are: (1926, 40, 120) The number of class zero files in the train split after segmentation are 504 The number of class one files in the train split after segmentation are 504"

For your last point, this is interesting. Initially, I would think that even though we now have more data for the depressed instances, the model is overfitting to the individuals in the training set who are depressed. Again from memory, I think that on average, the depressed files are actually longer than the non-depressed files as well. When it comes to evaluation, the model performs worse than expected due to this overfitting through overrepresentation. This would be an interesting avenue to explore though, as I remember one paper (can't remember the title) made a one off comment that using more data with the DAIC-WOZ didn't result in better performance.

Hope this helps, any other questions let me know.

Kind regards,

Andrew

kingformatty commented 3 years ago

Hi Andrew, Thank you for your feedback and I apologize for my late response, since I am still searching around why the number of class zero and one are different between your implementation and my modified one. I didn't see where is the sample log file you said with the 504 segments for each class zero and one. Would you mind clarify that?

Have you ever had the 468 number in your previous experiment. Since the 468 is obtained from min(male_dep, male_nondep, female_dep, female_nondep) * 2. And previously the minimum number is 234 for those four subclasses. A sample log file will definitely help. I appreciate that.

For the imbalance issue where I set config.EXPERIMENT_DETAILS['CROP'] to be false. I think here is the reason, but it might not be concrete, just guessing. When we set the CROP to be false, we are actually using all depression segments as positive class. Therefore, multiple segments(sampels) will come from the same speaker. As stated in the DepAudioNet original paper, "longer signal of an individual may emphasize some characteristics that are person specific, which tends to deteriorate the situation". My guess is that, by setting CROP is false, the model will highly likely to classify specific individual's identity info, instead of dep/Non-dep status. Thus, when applying the model to the validation and test set, where all speakers are not seen by the model before, the model will suffer from weak performance on depression class since it's actually seeking for information from depression speaker identity from training set.

Best Jinhan

adbailey1 commented 3 years ago

Hi Jinhan,

I can't remember if I ever had 468 but I do remember that a previous iteration of my tool was incorrect. I will include the sample log and config file below but to explain where the 252 comes from: When using min_crop and mel spectrogram we have the following dictionary for min_samples, where the value is a list containing the feature length and the number of instances of "key" in the data.

{'fndep': [2084, 27], 'fdep': [2084, 17], 'mndep': [2084, 49], 'mdep': [2084, 14]}

From here we can see the subsection with the smallest number of instances is 'mdep'. Using our segment length of 120 we can see that (rounded up) the feature length, 2084, can be split 18 times. 18 14 = 252. When we are not using gender balance, this is simply doubled -> 512, and we randomly sample from 'mdep' and 'fdep' to make up the depressed class, and do the same for the ndep class. When we are using gender balance we sample from every subsection to obtain 4 252.

As a side note, we are actually sampling the minority class here (as well as the majority) which is not done in the DepAudioNet paper. However, for us to have the potential of fairly applying gender balancing, we must subsample the depressed class as well, however we still retain 92% of the samples available and we did not see any reduction in performance.

Finally, regarding your last point. Yes, I agree with your hypothesis, I was saying the same thing as you in my previous message so sorry if I wasn't being clear. However, I still think that if this sounds interesting, it would be good to make this more than just a hypothesis by testing it.

Kind regards,

Andrew

SAMPLE LOG FILE

`############################################################### EXPERIMENT DETAILS FEATURE_EXP: mel CLASS_WEIGHTS: False USE_GENDER_WEIGHTS: False SUB_SAMPLE_ND_CLASS: True CROP: True OVERSAMPLE: False SPLIT_BY_GENDER: True FEATURE_DIMENSIONS: 120 FREQ_BINS: 40 BATCH_SIZE: 20 SVN: True LEARNING_RATE: 0.001 Starting SEED: 1000 TOTAL_EPOCHS: 100 TOTAL_ITERATIONS: 3280 ITERATION_EPOCH: 1 SUB_DIR: TESTER EXP_RUNTHROUGH: 5 Current Seed: 1000 Logged into: andrew-ubuntu Experiment details: ############################################################### Optimiser: ADAM. Learning Rate: 0.001 The dimensions of the logmel features before segmentation are: [5660, 16832, 8779, 21555, 12445, 42031, 16727, 29119, 16375, 7459, 12714, 8918, 12423, 11325, 38291, 17682, 9264, 11029, 7783, 8968, 10352, 10511, 20056, 14697, 10831, 17557, 6411, 8661, 22681, 11915, 10530, 14786, 12947, 16976, 16469, 16689, 15108, 49061, 9321, 15668, 5919, 15364, 10003, 21620, 16627, 29256, 4169, 10727, 23166, 16756, 14633, 16886, 15347, 5412, 12238, 20519, 6281, 11044, 21883, 6535, 13312, 9445, 25058, 42644, 28962, 31434, 38525, 33049, 25710, 28338, 12193, 33481, 26009, 30547, 5851, 21016, 31691, 15825, 23003, 49111, 22034, 13095, 33130, 15705, 2084, 23466, 7433, 7744, 11779, 30103, 8712, 8922, 8895, 15683, 12279, 17840, 11949, 15688, 17520, 18950, 15618, 22127, 38126, 12837, 31666, 10933, 24210, 21126, 34634, 13083, 20161, 20642, 13701, 15972, 18063, 23260, 19976, 18185, 14783, 33339, 17253, 25329, 26334, 15586, 15831, 10590, 18406, 15994, 13874, 16604, 12364, 31531, 22089, 11717, 16101, 36709, 32400, 30167, 12743, 18918, 9645, 24539, 11213, 22579, 15855, 29918, 24954, 30161, 36001, 14684, 19131, 8778, 15046, 16360, 19534, 28496, 15846, 13189, 21673, 12573, 22294, 25795, 37357, 19653, 19596, 24197, 18550, 20603, 16933, 3483, 18334, 7879, 8103, 29806, 16237, 11706, 25361, 24439, 22970, 38021, 21847, 7568, 10244, 22260, 17013, 7668, 8288, 15911, 19360] Crop Partitions: {'fndep': [2084, 27], 'fdep': [2084, 17], 'mndep': [2084, 49], 'mdep': [2084, 14]} The dimensions of the train features are: (1926, 40, 120) The number of class zero files in the train split after segmentation are 504 The number of class one files in the train split after segmentation are 504 instance Weights: [1, 1] The dimensions of the dev features are: (6225, 40, 120) The number of class zero files in the dev split after segmentation are 3696 The number of class one files in the dev split after segmentation are 2529 instance Weights: [1, 1]

The per class class weights (Non-Depresed vs Depressed) are: [1, 1] Time taken for train: 0.33s at iteration: 50, epoch: 1 Time taken to evaluate dev: 0.44s`

SAMPLE CONFIG FILE



# Use this string to write a brief detail about the current experiment. This
# string will be saved in a logger for this particular experiment
EXPERIMENT_BRIEF = ''

# Set to complete to use all the data
# Set to sub to use training/dev sets only
# Network options: custom or custom_att (to use the attention mechanism)
# Set to complete to use all the data
# Set to sub to use training/dev sets only
# Network options: custom or custom_att (to use the attention mechanism)
EXPERIMENT_DETAILS = {'FEATURE_EXP': 'mel',
                      'CLASS_WEIGHTS': False,
                      'USE_GENDER_WEIGHTS': False,
                      'SUB_SAMPLE_ND_CLASS': True,  # Make len(dep) == len(
                      # ndep)
                      'CROP': True,
                      'OVERSAMPLE': False,
                      'SPLIT_BY_GENDER': True,  # Only for use in test mode
                      'FEATURE_DIMENSIONS': 120,
                      'FREQ_BINS': 40,
                      'BATCH_SIZE': 20,
                      'SVN': True,
                      'LEARNING_RATE': 1e-3,
                      'SEED': 1000,
                      'TOTAL_EPOCHS': 100,
                      'TOTAL_ITERATIONS': 3280,
                      'ITERATION_EPOCH': 1,
                      'SUB_DIR': 'this_is_a_test',
                      'EXP_RUNTHROUGH': 5}
# Determine the level of crop, min file found in training set or maximum file
# per set (ND / D) or (FND, MND, FD, MD)
MIN_CROP = True
# Determine whether the experiment is run in terms of 'epoch' or 'iteration'
ANALYSIS_MODE = 'epoch'

# How to calculate the weights: 'macro' uses the number of individual
# interviews in the training set (e.g. 31 dep / 76 non-dep), 'micro' uses the
# minimum number of segments of both classes (e.g. min_num_seg_dep=35,
# therefore every interview in depressed class will be normalised according
# to 35), 'both' combines the macro and micro via the product, 'instance'
# uses the total number of segments for each class to determine the weights (
# e.g. there could be 558 dep segs and 440 non-dep segs).
WEIGHT_TYPE = 'instance'

# Set to 'm' or 'f' to split into male or female respectively
# Otherwise set to '-' to keep both genders in the database
GENDER = '-'

# These values should be the same as those used to create the database
# If raw audio is used, you might want to set these to the conv kernel and
# stride values
WINDOW_SIZE = 1024
HOP_SIZE = 512
OVERLAP = int((HOP_SIZE / WINDOW_SIZE) * 100)
kingformatty commented 3 years ago

Thank you for detailed feedback. Unfortunately, we are still having trouble reproducing number of samples for each class to be 468, even with the update to date code. I guess it might be the problem with the pre-processing issue. I have seen a case saying the database preparation part. Using the code same as your answer:

import h5py

f1 = h5py.File('path/to/database', 'r')

feat = f1['features'][:]

print(feat[0][0][0])

Mine is also "-0.023413863". Would you mind providing any other details about the intermediate values just for a sanity check ?

I appreciate that.

Thank you very much.

adbailey1 commented 3 years ago

This is strange. I will have a look tomorrow and get back to you. In the meantime, what dataset are you using? The original DAIC-WOZ or the extended version? Everything here is on the original DAIC-WOZ 2016 dataset, not the newer 2019.

Also by running the following code, what is your value of b:

NOTE: I am assuming the database file is saved as 'complete_database.h5' and that we are using mel spectrogram features with 40 mel bins.

import numpy as np
import h5py

mel_bins = 40
h5 = h5py.File('complete_database.h5', 'r')
f = h5['features'][:]
a = 1e12
for i in f:
    if i[0].shape[0] < a:
        a = i[0].shape[0] 
b = a / mel_bins

I get 2084.

kingformatty commented 3 years ago

Thank you for your feedback, I was using Daic-woz old dataset, not the new one. And after running this code, my output is also 2084. Actually, in the log file, everything before the "The number of class zero files in the train split after segmentation are............." are the same.

adbailey1 commented 3 years ago

This is very strange. I have printed out some values as I debugged the code, have a look and see if your values match. Hopefully we can find something that doesn't and look around there for the error!

Organiser.py

Line 821 - organsie_data()

features.shape => (107, 1)

Line 721 - determine_crops()

lengths:

{'fndep': [21555, 12445, 8661, 15364, 6535, 25710, 30547, 8712, 8922, 17520, 13701, 19976, 17253, 26334, 15831, 24539, 22579, 24954, 8778, 15046, 16360, 3483, 16237, 10244, 22260, 17013, 15911], 'mndep': [42031, 12714, 12423, 11325, 17682, 9264, 11029, 7783, 20056, 10831, 6411, 22681, 16976, 15108, 5919, 10003, 6281, 11044, 25058, 42644, 31434, 33049, 28338, 12193, 5851, 23003, 33130, 2084, 8895, 17840, 15688, 15972, 10590, 18406, 15994, 31531, 16101, 9645, 11213, 15855, 19534, 12573, 22294, 19596, 20603, 18334, 7879, 11706, 7568], 'mdep': [8968, 10530, 15668, 21620, 4169, 12238, 20519, 49111, 24210, 13083, 15586, 12364, 29918, 15846], 'fdep': [10352, 10511, 17557, 9321, 16627, 10727, 16756, 14633, 16886, 15347, 9445, 33481, 21016, 23466, 18950, 20642, 12743]}

Line 675 - crop_sections()

fndep_crop = 3483
mndep_crop = 2084
fdep_crop = 9321
mdep_crop = 4169

Line 702 - crop_sections()

min_crop_val = 2084

Line 716 - crop_section()

crops = {'fndep': [2084, 27], 'fdep': [2084, 17], 'mndep': [2084, 49], 'mdep': [2084, 14]}

Line 139 - process_data()

IN FOR LOOP

temp_length = 486 ... 306 ... 882 ... 252
length = 1926

Line 201 - process_data()

IN FOR LOOP

new_features.shape = 18, 40, 120

Line 244 - process_data()

update_features.shape = 1926, 40, 120

Line 409 - data_info()

len(zeros_index_f) = 486
len(zeros_index_m) = 882
len(ones_index_f) = 306
len(ones_index_m) = 252

Line 412 - data_info()

min_set = 504

FromNature commented 3 years ago

Hello, Andrew, I've been working on speech recognition for depression, in line with your work. By intercepting a person's voice into small samples, I extracted the MFCC features of 39 dimensions and used 1DCNN for identification. However, my classification result is not ideal so far. I don't know whether it is because of the network or MFCC. I looked at your code and saw that you were also extracting MFCC features, but why didn't you end up using MFCC features? Does the MFCC feature not work well? Thank you for your reply.

adbailey1 commented 3 years ago

Hi Evergreen0,

I added the MFCC implementation section recently as it is quite commonly used in emotion recognition, which is my current topic of interest. So yes, feel free to try using MFCC and play around with the hyperparameters/network architecture; this tool is here to help people start from a baseline as I created it to try and emulate the DepAudioNet paper (which used mel spectrogram features). I explored the dataset with variations on this type of feature along with raw-audio.

Some features may not work at all, others may require architecture changes to get them to perform well, the dataset organisers also extract a number of features that may perform well, so there is a lot of scope to explore and research.

Hope that answers your question and gives you an insight into my design choices!

FromNature commented 3 years ago

Thank you for your reply! As you said, I am trying to use MFCC for depression recognition recently. The specific approach is to remove the voice of mute and Ellie, divide it into small samples every 7s, extract MFCC, and use 1D-CNN for classification. However, according to the current results, the effect is not ideal, unfortunately. I don't know how to solve this problem, should I give up using MFCC and use sound spectrum instead. Do you have any good suggestions? Thanks for your reply! From someone who's just doing research

adbailey1 commented 2 years ago

You are welcome to use my pre-processing tool to remove Ellie and resolve several issues surrounding the dataset: https://github.com/adbailey1/daic_woz_process.

I would suggest trying different lengths of audio (3s, 5s, 7s, 9s) as input to the network, different number of coefficients for MFCC and try adding the delta + double delta as well. Try different learning rates and when to update the learning rate, and have a play around with the architecture.

I would also recommend reading some papers about other types of audio classification and see what they do in order to implement their techniques to help train the model.

It's really up to you how long you try to get things to work, sometimes spending a lot of time trying different experiments is beneficial, other times it's better to move on (for example you could look at the textual or visual data for the dataset).

FromNature commented 2 years ago
Thanks for your reply! It's nice of you to share your code. I will study hard. Expect more academic achievements from you 杜铭浩

@. | 签名由网易邮箱大师定制 On 9/9/2021 19:13,Andrew @.> wrote:

You are welcome to use my pre-processing tool to remove Ellie and resolve several issues surrounding the dataset: https://github.com/adbailey1/daic_woz_process.

I would suggest trying different lengths of audio (3s, 5s, 7s, 9s) as input to the network, different number of coefficients for MFCC and try adding the delta + double delta as well. Try different learning rates and when to update the learning rate, and have a play around with the architecture.

I would also recommend reading some papers about other types of audio classification and see what they do in order to implement their techniques to help train the model.

It's really up to you how long you try to get things to work, sometimes spending a lot of time trying different experiments is beneficial, other times it's better to move on (for example you could look at the textual or visual data for the dataset).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

b5y commented 2 years ago

Hi Andrew,

I am also getting over-fitting (working with audio files for now). Here are steps

1) fetch data (old one, 2017) from https://dcapswoz.ict.usc.edu/wwwdaicwoz/ 2) move audio_feature_extractor.py to daic_woz_process 3) run python -m run under daic_woz_process folder (code is here) 4) run Mel models (CustomMel7 or CustomMel15)

The difference in validation loss and training loss is very huge (see attached picture).

Is there anything what I am missing? Please fix me if something is wrong.

Key-value pairs in configuration file:

EXPERIMENT_DETAILS = {'FEATURE_EXP': 'logmel',
                      'CLASS_WEIGHTS': False,
                      'USE_GENDER_WEIGHTS': False,
                      'SUB_SAMPLE_ND_CLASS': True,  # Make len(dep) == len(
                      # ndep)
                      'CROP': True,
                      'OVERSAMPLE': False,
                      'SPLIT_BY_GENDER': False,  # Only for use in test mode
                      'DATASET_IS_BACKGROUND': False,
                      'FEATURE_DIMENSIONS': 120,
                      'FREQ_BINS': 40,
                      'BATCH_SIZE': 20,
                      'SNV': True,
                      'LEARNING_RATE': 1e-3,
                      'SEED': 1000,
                      'TOTAL_EPOCHS': 100,  # TODO: Change this once finish the code
                      'TOTAL_ITERATIONS': 3280,
                      'ITERATION_EPOCH': 1,
                      'SUB_DIR': 'model',
                      'EXP_RUNTHROUGH': 5,
                      'REMOVE_BACKGROUND': True}

The output from the database file:

import h5py

f1 = h5py.File('path/to/database', 'r')

feat = f1['features'][:]

print(feat[0][0][0])

is -16.01733. It seems it is different from what is here

And the other output from database file:

import numpy as np
import h5py

mel_bins = 40
h5 = h5py.File('complete_database.h5', 'r')
f = h5['features'][:]
a = 1e12
for i in f:
    if i[0].shape[0] < a:
        a = i[0].shape[0] 
b = a / mel_bins

is 2084.

Thank you in advance.

Metrics output plot: image

BR, Mehti

adbailey1 commented 2 years ago

Hi Mehti,

The depression dataset is very hard and therefore this is to be expected. Many researchers have reported their results on the development set only (which is also what I did as the test set labels are meant to be hidden for competitive reasons) so our focus is trying to reduce the difference as you say between the training and development set.

Your methodology looks fine and so does the config file, the only difference I see, which might attribute to the differing values is that I used mel-spectograms not logmel.

One way to potentially improve this performance is data augmentation. Using this on the training data only might work. You could try gaussian noise, pitch shifting, and tempo changes (although I would worry about the last one as it might alter the training set too much w.r.t. depression characteristics).

Another thing to try would be to train your model using different learning regimes, you could try SGD instead of ADAM for example. As I mentioned above, you could experiment with increasing the length of the audio files, even if the initial performance is worse, you could try adapting the model or the training regime and that might get better results.

Hope this helps.

Kind regards,

Andrew