felixgontier / dcase-2023-baseline

14 stars 6 forks source link

Baseline system for DCASE 2023 task 6, subtask A

This repository contains the baseline system for the DCASE 2023 challenge task 6A on audio captioning.

The main model is composed of a convolutional encoder and a transformer decoder, that autoregressively models captions conditionally to log-mel spectrograms. This year, the baseline reuses the audio encoder trained as part of the task 6B baseline on audio retrieval.

For more information, please refer to the corresponding DCASE subtask page.


Table of contents

  1. Repository setup
  2. Clotho dataset
    1. Obtaining the data from Zenodo
    2. Data pre-processing
    3. Pre-processing parameters
  3. Running the baseline system
    1. Running an experiment
    2. Evaluation with pre-trained weights
  4. Details of experiment settings
    1. Adaptation settings
    2. Data settings
    3. Language model settings
    4. Training settings
    5. Workflow settings

Repository setup

The first step in running the baseline system is to clone this repository on your computer:

$ git clone git@github.com:felixgontier/dcase-2023-baseline.git

This operation will create a dcase-2023-baseline directory at the current location, with the contents of this repository. The dcase-2023-baseline will be referred to as the root directory in the rest of this readme.

Next, a recent version of PyTorch is required to run the baseline.

Note: The baseline system is developed with Python 3.7, PyTorch 1.7.1 and CUDA 10.1. Please refer to the PyTorch setup guide for PyTorch/CUDA compatibility information.

Other required packages can be installed using Pip by running the following command in the root directory:

$ python3.7 -m venv env/ # Optionally create a virtual environment
$ pip install -r requirements_pip.txt

The audio encoder of the baseline system is initialized with trained weights from the retrieval subtask baseline. The corresponding checkpoint audio_encoder.pth is hosted on Zenodo.

  1. Download audio_encoder.pth from the Zenodo repository and place it in the root baseline directory.

Lastly, the caption-evaluation-tools is needed for evaluation.

  1. Download and extract the repository in the baseline root directory.
  2. Download the Stanford models by running:
$ cd coco_caption
$ ./get_stanford_models.sh

Note that the caption evaluation tools require that Java is installed and enabled.


Clotho dataset

Obtaining the data from Zenodo

The Clotho v2.1 dataset can be found on Zenodo: DOI

The test set (without captions) is available separately: DOI

After downloading all .7z archives and .csv caption files from both repositories, audio files should be extracted in the data directory.

Specifically, the directory structure should be as follows from the baseline root directory:

data/
 | - clotho_v2/
 |   | - development/
 |   |   | - *.wav
 |   | - validation/
 |   |   | - *.wav
 |   | - evaluation/
 |   |   | - *.wav
 |   | - test/
 |   |   | - *.wav
 |   | - clotho_captions_development.csv
 |   | - clotho_captions_validation.csv
 |   | - clotho_captions_evaluation.csv

Data pre-processing

Pre-processing operations are implemented in clotho_dataset.py and audio_logmels.py. These are the same as for the task 6B baseline, except for the addition of the Clotho-testing subset handling.

Dataset preparation is done by running the following command:

$ python clotho_dataset.py
$ python audio_logmels.py

The script outputs <split>_audio_logmels.hdf5 and <split>_text.csv files in the data subdirectory.


Running the baseline system

Running an experiment

Experiments settings are defined in a YAML file located in the exp_settings directory. The dcb.yaml file contains parameters used to produce the reported baseline results. Specific settings are detailed below.

To run an experiment according to a <exp_name>.yaml settings file, use the following command:

$ python main.py --exp <exp_name>

After training, model weights are saved to a outputs/<exp_name>_out/ directory.

Evaluation with pre-trained weights

  1. Download pre-trained weights from DOI
  2. In exp_settings/dcb.yaml, change the lm/eval_model setting to /path/to/dcase_baseline_pre_trained.bin, with the correct path to the downloaded file.
  3. Set the workflow/train and workflow/validate to false, and workflow/evaluate and/or workflow/infer to true.
  4. Run the evaluation and/or inference.
$ python main.py --exp dcb

Details of experiment settings

Experiment settings described in the exp_settings/dcb.yaml file are:

adapt:
  audio_emb_size: 2048
  nb_layers: 1
data:
  root_dir: data
  max_audio_len: 2048
  max_caption_tok_len: 64
lm:
  config: # Model parameters
    activation_dropout: 0.1
    activation_function: 'gelu'
    attention_dropout: 0.1
    classifier_dropout: 0.0
    d_model: 768
    decoder_attention_heads: 12
    decoder_ffn_dim: 3072
    decoder_layers: 6
    dropout: 0.1
    encoder_attention_heads: 12
    encoder_ffn_dim: 3072
    encoder_layers: 0
    vocab_size: 50265
  generation: # Generation parameters
    early_stopping: true
    no_repeat_ngram_size: 3
    num_beams: 4
    min_length: 5
    max_length: 100
    length_penalty: 1.0
    decoding: beam
  eval_model: best
  eval_checkpoint: null
  freeze:
    all: false
    attn: false
    dec: false
    dec_attn: false
    dec_mlp: false
    dec_self_attn: false
    enc: false
    enc_attn: false
    enc_mlp: false
    mlp: false
  tokenizer: facebook/bart-base
  pretrained: null
training:
  eval_steps: 1000
  force_cpu: false
  batch_size: 4
  gradient_accumulation_steps: 2
  num_workers: 8
  lr: 1.0e-05
  nb_epochs: 20
  save_steps: 1000
  seed: 0
workflow:
  train: true
  validate: true
  evaluate: true
  infer: false

Adaptation settings

The adaptation block defines a small adaptation network before the transformer encoder. Its aim is to adjust the dimension of audio features to that of the transformer (lm/config/d_model setting).

Data settings

The data block contains settings related to the dataset.

Language model settings

The lm block contains settings related to both the encoder and decoder of the main transformer model, which is derived from BART.

The config sub-block details the model, as per the HuggingFace BART configuration. Provided settings replicate the bart-base model configuration.

Note: The vocab_size parameter depends on the pre-trained tokenizer defined by lm/tokenizer.

The generation sub-block provides generation-specific settings (see the HuggingFace Generation documentation):

The freeze sub-block enables freezing different components of the transformer (attention, MLP, self-attention or cross-attention).

Other parameters are:

Training settings

The training block describes parameters of the training process.

Workflow settings

The workflow block sets operations to be conducted in the experiment.