b04901014 / FT-w2v2-ser

Official implementation for the paper Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition
MIT License
136 stars 32 forks source link

Fine-tuning Wav2vec 2.0 for SER

Official implementation for the paper Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition. Submitted to ICASSP 2022.

Libraries and dependencies

Faiss can be skipped if you are not running clustering scripts. Or you can simply check the DockerFile at docker/Dockerfile for our setup. To train the first phase wav2vec model of P-TAPT, you'll need the the pretrained wav2vec model checkpoint from Facebook AI Research, which can be obtained here.

Reproduce on IEMOCAP

Result averaged over 5 runs (only the fine-tuning stage is ran 5 times) with standard deviation: alt text

Prepare IEMOCAP

Obtain IEMOCAP from USC

cd Dataset/IEMOCAP &&
python make_16k.py IEMOCAP_DIR &&
python gen_meta_label.py IEMOCAP_DIR &&
python generate_labels_sessionwise.py &&
cd ../..

Run scripts

The OUTPUT_DIR should not be exist and different for each method, note that it will take a long time since we need to run NUM_EXPS and average. The statistics will be at OUTPUT_DIR/{METHOD}.log along with some model checkpoints. Note that it takes a long time and lots of VRAM, if you are concerned at computation, try lower the batch size (but the results may not be as expected).

Run the training scripts on your own dataset

You will need a directory containing all the training wave files sampled at 16kHz, and a json file which contains the emotion label, and the training/validation/testing splits in the following format:

{
    "Train": {
        audio_filename1: angry,
        audio_filename2: sad,
        ...
    }
    "Val": {
        audio_filename1: neutral,
        audio_filename2: happy,
        ...
    }
    "Test": {
        audio_filename1: neutral,
        audio_filename2: angry,
        ...
    }
}

V-FT

python run_downstream_custom_multiple_fold.py --precision 16 \
                                              --num_exps NUM_EXP \
                                              --datadir Audio_Dir \
                                              --labeldir LABEL_DIR \
                                              --saving_path SAVING_CKPT_PATH \
                                              --outputfile OUTPUT_FILE

TAPT

Task adaptive training stage

python run_baseline_continueFT.py --saving_path SAVING_CKPT_PATH \
                                  --precision 16 \
                                  --datadir Audio_Dir \
                                  --labelpath LABEL_DIR \
                                  --training_step TAPT_STEPS \
                                  --warmup_step 100 \
                                  --save_top_k 1 \
                                  --lr 1e-4 \
                                  --batch_size 64 \ 
                                  --use_bucket_sampler

Run the fine-tuning stage

python run_downstream_custom_multiple_fold.py --precision 16 \
                                              --num_exps NUM_EXP \
                                              --datadir Audio_Dir \
                                              --labeldir LABEL_DIR \
                                              --saving_path SAVING_CKPT_PATH \
                                              --pretrained_path PRETRAINED_PATH \
                                              --outputfile OUTPUT_FILE

Train the first phase wav2vec

python run_pretrain.py --datadir Audio_Dir \
                       --labelpath Label_Path
                       --labeling_method hard \
                       --saving_path Saving_Path \
                       --training_step 10000 \
                       --save_top_k 1 \
                       --wav2vecpath Wav2vecCKPT \
                       --precision 16

This will output a w2v-{epoch:02d}-{valid_loss:.2f}-{valid_acc:.2f}.ckpt file according to the best performance on the validation split (randomly split from training set).

Clustering

python cluster.py --model_path Model_Path \
                  --labelpath Label_Path \
                  --datadir Audio_Dir \
                  --outputdir Output_Dir \
                  --model_type wav2vec \
                  --sample_ratio 1.0 \
                  --num_clusters "64,512,4096"

Train the second phase wav2vec2

python run_second.py --datadir Audio_Dir \
                     --labelpath Label_Path \
                     --precision 16 \
                     --num_clusters "64,512,4096" \
                     --training_step 20000 \
                     --warmup_step 100 \
                     --saving_path Save_Path \
                     --save_top_k 1 \
                     --use_bucket_sampler \
                     --dynamic_batch

Label_Path is the labelfile we get from the first round clustering. To specify our own custom clusters as done in the first phase, use num_clusters option similar to the first round clustering (default is 8,64,512,4096). This will write a w2v2-{epoch:02d}-{valid_loss:.2f}-{valid_acc:.2f}.ckpt for your best model on validation (if you have validation set), and a last.ckpt for the checkpoint of the last epoch.

Run the fine-tuning stage

python run_downstream_custom_multiple_fold.py --precision 16 \
                                              --num_exps NUM_EXP \
                                              --datadir Audio_Dir \
                                              --labeldir LABEL_DIR \
                                              --saving_path SAVING_CKPT_PATH \
                                              --pretrained_path PRETRAINED_PATH \
                                              --outputfile OUTPUT_FILE