b04901014 / FT-w2v2-ser

Official implementation for the paper Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition
MIT License
135 stars 33 forks source link

The result on MELD dataset is not as good as IEMOCAP #6

Open Gpwner opened 2 years ago

Gpwner commented 2 years ago
  1. I download the MELD Datasets from https://affective-meld.github.io/
  2. Then I do some lable mapping which map the ‘joy’ to 'happy' ,'sadness' to 'sad' by this script:
    if x == 'neutral':
        return 'neutral'
    elif x == 'joy':
        return 'happy'
    elif x == 'anger':
        return 'anger'
    elif x == 'sadness':
        return 'sad'
    else:
        return '-1'

    3.Then I exact wav from mp4 by this shell:

    #!/bin/bash
    files=$(ls $1)
    for filename in $files
    do
    echo ${filename%.*}
    ffmpeg -i $1/${filename%.*}.mp4 -f wav -ar 16000 -ac 1  $2/${filename%.*}.wav
    done

    4.I select samples to train the model.And here my sample statistics:

    
    Statistics of training splits:
    ----Involved Emotions----
    sad: 450 examples
    anger: 450 examples
    neutral: 450 examples
    happy: 450 examples
    Total 1800 examples
    ----Examples Involved----

Statistics of testing splits: ----Involved Emotions---- sad: 233 examples anger: 233 examples happy: 233 examples neutral: 233 examples Total 932 examples ----Examples Involved----

5、Then I run the https://github.com/b04901014/FT-w2v2-ser/blob/main/bin/run_exp_iemocap.sh
Unfortunately I got an error like this:

f"mask_length has to be smaller than sequence_length, but got mask_length: {mask_length} and sequence_length: {sequence_length}"

So I modified the code in https://github.com/b04901014/FT-w2v2-ser/blob/main/modules/FeatureFuser.py from 
            # apply SpecAugment along time axis
            batch_size, sequence_length, hidden_size = wav2vec_z.size()
            mask_time_indices = _compute_mask_indices(
                (batch_size, sequence_length),
                self.mask_time_prob,
                self.mask_time_length,
                min_masks=2,
                device=x.device
            )
to
            # apply SpecAugment along time axis
            batch_size, sequence_length, hidden_size = wav2vec_z.size()
            mask_length = min(self.mask_time_length, sequence_length)
            mask_time_indices = _compute_mask_indices(
                (batch_size, sequence_length),
                self.mask_time_prob,
                mask_length,
                min_masks=2,
                device=x.device
            )
6.Finally I got the result

![image](https://user-images.githubusercontent.com/19349207/145134161-5a1febb8-e146-4eb7-9b1b-6ceecaff989a.png)

Here is the P-TAPT.log:

+++ SUMMARY +++ Mean UAR [%]: 39.21 Fold Std. UAR [%]: 0.00 Fold Median UAR [%]: 39.21 Run Std. UAR [%]: 0.62 Run Median UAR [%]: 38.84 Mean WAR [%]: 39.21 Fold Std. WAR [%]: 0.00 Fold Median WAR [%]: 39.21 Run Std. WAR [%]: 0.62 Run Median WAR [%]: 38.84 Mean macroF1 [%]: 38.16 Fold Std. macroF1 [%]: 0.00 Fold Median macroF1 [%]: 38.16 Run Std. macroF1 [%]: 1.45 Run Median macroF1 [%]: 37.73 Mean microF1 [%]: 39.95 Fold Std. microF1 [%]: 0.00 Fold Median microF1 [%]: 39.95 Run Std. microF1 [%]: 0.38 Run Median microF1 [%]: 39.93


And the confusion matrix:

[[358. 310. 207. 290.] [145. 512. 230. 278.] [175. 299. 370. 321.] [156. 231. 191. 587.]]



The result is so bad,It will be great if you can help.
Gpwner commented 2 years ago

Here is the part_of_friends.json: part_of_friends.zip

b04901014 commented 2 years ago

Some guidelines for debugging.

b04901014 commented 2 years ago

Also, just some experience with working on the MELD audio.

Gpwner commented 2 years ago

V-FT even worse.Do I miss anything? I run the python code:

python run_downstream_custom_multiple_fold.py --precision 16 --datadir Dataset/combineData16KHz/ --labeldir PART_OF_FRIENDS/labels/ --saving_path VFT/downstreammul --outputfile VFT.log  --max_epochs=30

Here is my result:

+++ SUMMARY +++
Mean UAR [%]: 25.00
Fold Std. UAR [%]: 0.00
Fold Median UAR [%]: 25.00
Run Std. UAR [%]: 0.00
Run Median UAR [%]: 25.00
Mean WAR [%]: 25.00
Fold Std. WAR [%]: 0.00
Fold Median WAR [%]: 25.00
Run Std. WAR [%]: 0.00
Run Median WAR [%]: 25.00
Mean macroF1 [%]: 10.00
Fold Std. macroF1 [%]: 0.00
Fold Median macroF1 [%]: 10.00
Run Std. macroF1 [%]: 0.00
Run Median macroF1 [%]: 10.00
Mean microF1 [%]: 10.00
Fold Std. microF1 [%]: 0.00
Fold Median microF1 [%]: 10.00
Run Std. microF1 [%]: 0.00
Run Median microF1 [%]: 10.00

the confusion matrix:

[[233.   0.   0.   0.]
 [233.   0.   0.   0.]
 [233.   0.   0.   0.]
 [233.   0.   0.   0.]]

The training log is attached: V_FT.zip

By the way,when you come to The audio need to be normalized,what is the recommend way? Thanks

b04901014 commented 2 years ago

You can observe from the training loss, it is not decreasing for V-FT. So the training is not even happening.

Something like:

wav = (wav - wav.mean()) / (wav.std() + 1e-12)

will normalize the audio signal.

Some dataset needs this for the loss to go down depending on the recording environment.

b04901014 commented 2 years ago

You may add it at the __getitem__ of the downstream dataloader But if you run TPAPT/PTAPT, you'll also have to add it at the pretrain dataloader Or you can simply preprocess the audio.

Gpwner commented 2 years ago

So the mean and std should be calculated across all the trainning samples before return a single sample from https://github.com/b04901014/FT-w2v2-ser/blob/main/downstream/Custom/dataloader.py#L73 like this:

#mean and std is calculate by all the training samples
return (wav.astype(np.float32)-mean)/(std+1e-12), label

right?

b04901014 commented 2 years ago

No. That is another way of doing normalization for spectral-based features.

For raw audio, we can do it inside each samples, where the statistics are calculated and normalized per sample. This will normalize the average volume (std) and DC (mean) which shouldn't effect the emotion information.

Gpwner commented 2 years ago

No. That is another way of doing normalization for spectral-based features.

For raw audio, we can do it inside each samples, where the statistics are calculated and normalized per sample. This will normalize the average volume (std) and DC (mean) which shouldn't effect the emotion information.

So the getitem method(https://github.com/b04901014/FT-w2v2-ser/blob/main/downstream/Custom/dataloader.py#L68) should be like this:

    def __getitem__(self, i):
        dataname = self.dataset[i]
        wav, _sr = sf.read(dataname)
        _label = self.label[self.datasetbase[i]]
        label = self.labeldict[_label]
        wav = wav.astype(np.float32)
        wav = (wav - wav.mean()) / (wav.std() + 1e-12)
        return wav, label

But the result didn't get better,here is the training log: V_FT.zip By the way,I have change the command to:

python run_downstream_custom.py --precision 16 --datadir Dataset/combineData16KHz/ --labelpath PART_OF_FRIENDS/labels/part_of_friends.json --output_path VFT/downstreammul  --max_epochs=30
b04901014 commented 2 years ago

You can still observe it from the log that your loss is not decreasing. The learning rate need to be lower, such as --lr 2e-5.

qq.log

Here is some log I use --lr 2e-5 --batch_size 32 on your dataset with only 15 epochs. Again, you can see that my loss is decreasing in a few epochs, but yours is not. So you should tune the hyper-params according to that.

Also, I would imagine it requires much more epochs to converge. You can see that 15 epochs only get the loss about 0.9. Ideally we want to get the training loss to overfit to about 0.2~0.3.

Gpwner commented 2 years ago

You can still observe it from the log that your loss is not decreasing. The learning rate need to be lower, such as --lr 2e-5.

qq.log

Here is some log I use --lr 2e-5 --batch_size 32 on your dataset with only 15 epochs. Again, you can see that my loss is decreasing in a few epochs, but yours is not. So you should tune the hyper-params according to that.

Also, I would imagine it requires much more epochs to converge. You can see that 15 epochs only get the loss about 0.9. Ideally we want to get the training loss to overfit to about 0.2~0.3.

Got it.So how can we decide whether a dataset should be normalized or not?

Gpwner commented 2 years ago

I think maybe we don't need the normalization,I just run the code without normalization and the batch-size is 64:

python run_downstream_custom.py --precision 16 --datadir Dataset/combineData16KHz/ --labelpath PART_OF_FRIENDS/labels/part_of_friends.json --output_path VFT/downstreammul --max_epochs=30 --lr 2e-5

After 30 epoches the loss is decreasing to loss=0.422,here is the training log: V_FT_NO_NORMALIZITION.zip

b04901014 commented 2 years ago

Yeah, maybe it's just the learning rate that matters. Hyper-parameters should be tuned dataset-to-dataset.