This repository contains the baseline system for the DCASE 2023 challenge task 6A on audio captioning.
The main model is composed of a convolutional encoder and a transformer decoder, that autoregressively models captions conditionally to log-mel spectrograms. This year, the baseline reuses the audio encoder trained as part of the task 6B baseline on audio retrieval.
For more information, please refer to the corresponding DCASE subtask page.
The first step in running the baseline system is to clone this repository on your computer:
$ git clone git@github.com:felixgontier/dcase-2023-baseline.git
This operation will create a dcase-2023-baseline
directory at the current location, with the contents of this repository. The dcase-2023-baseline
will be referred to as the root directory in the rest of this readme.
Next, a recent version of PyTorch is required to run the baseline.
Note: The baseline system is developed with Python 3.7, PyTorch 1.7.1 and CUDA 10.1. Please refer to the PyTorch setup guide for PyTorch/CUDA compatibility information.
Other required packages can be installed using Pip by running the following command in the root directory:
$ python3.7 -m venv env/ # Optionally create a virtual environment
$ pip install -r requirements_pip.txt
The audio encoder of the baseline system is initialized with trained weights from the retrieval subtask baseline. The corresponding checkpoint audio_encoder.pth
is hosted on Zenodo.
audio_encoder.pth
from the Zenodo repository and place it in the root baseline directory.Lastly, the caption-evaluation-tools is needed for evaluation.
$ cd coco_caption
$ ./get_stanford_models.sh
Note that the caption evaluation tools require that Java is installed and enabled.
The Clotho v2.1 dataset can be found on Zenodo:
The test set (without captions) is available separately:
After downloading all .7z
archives and .csv
caption files from both repositories, audio files should be extracted in the data
directory.
Specifically, the directory structure should be as follows from the baseline root directory:
data/
| - clotho_v2/
| | - development/
| | | - *.wav
| | - validation/
| | | - *.wav
| | - evaluation/
| | | - *.wav
| | - test/
| | | - *.wav
| | - clotho_captions_development.csv
| | - clotho_captions_validation.csv
| | - clotho_captions_evaluation.csv
Pre-processing operations are implemented in clotho_dataset.py
and audio_logmels.py
. These are the same as for the task 6B baseline, except for the addition of the Clotho-testing subset handling.
Dataset preparation is done by running the following command:
$ python clotho_dataset.py
$ python audio_logmels.py
The script outputs <split>_audio_logmels.hdf5
and <split>_text.csv
files in the data
subdirectory.
Experiments settings are defined in a YAML file located in the exp_settings
directory. The dcb.yaml
file contains parameters used to produce the reported baseline results.
Specific settings are detailed below.
To run an experiment according to a <exp_name>.yaml
settings file, use the following command:
$ python main.py --exp <exp_name>
After training, model weights are saved to a outputs/<exp_name>_out/
directory.
exp_settings/dcb.yaml
, change the lm/eval_model
setting to /path/to/dcase_baseline_pre_trained.bin
, with the correct path to the downloaded file.workflow/train
and workflow/validate
to false
, and workflow/evaluate
and/or workflow/infer
to true
.$ python main.py --exp dcb
Experiment settings described in the exp_settings/dcb.yaml
file are:
adapt:
audio_emb_size: 2048
nb_layers: 1
data:
root_dir: data
max_audio_len: 2048
max_caption_tok_len: 64
lm:
config: # Model parameters
activation_dropout: 0.1
activation_function: 'gelu'
attention_dropout: 0.1
classifier_dropout: 0.0
d_model: 768
decoder_attention_heads: 12
decoder_ffn_dim: 3072
decoder_layers: 6
dropout: 0.1
encoder_attention_heads: 12
encoder_ffn_dim: 3072
encoder_layers: 0
vocab_size: 50265
generation: # Generation parameters
early_stopping: true
no_repeat_ngram_size: 3
num_beams: 4
min_length: 5
max_length: 100
length_penalty: 1.0
decoding: beam
eval_model: best
eval_checkpoint: null
freeze:
all: false
attn: false
dec: false
dec_attn: false
dec_mlp: false
dec_self_attn: false
enc: false
enc_attn: false
enc_mlp: false
mlp: false
tokenizer: facebook/bart-base
pretrained: null
training:
eval_steps: 1000
force_cpu: false
batch_size: 4
gradient_accumulation_steps: 2
num_workers: 8
lr: 1.0e-05
nb_epochs: 20
save_steps: 1000
seed: 0
workflow:
train: true
validate: true
evaluate: true
infer: false
The adaptation
block defines a small adaptation network before the transformer encoder. Its aim is to adjust the dimension of audio features to that of the transformer (lm/config/d_model
setting).
audio_emb_size
(int): Dimension of audio features, i.e. the input dimension of the adaptation network. In the case of VGGish embeddings, this setting is set to 128.nb_layers
(int): Number of layers of the network. If set to 0, the dimension of audio features must be equal to that of the transformer. If greater than 1, the network will contain nb_layers
dense layers with output dimension lm/config/d_model
and ReLU activations. The last layer of the adaptation network has no activation function.The data
block contains settings related to the dataset.
root_dir
(str): Path to the data root directory.max_audio_len
and max_caption_tok_len
(int): The data loader pads each example audio and tokenized caption to a set duration for batching. Provided values are adapted to the VGGish representation and BART tokenization of the baseline.The lm
block contains settings related to both the encoder and decoder of the main transformer model, which is derived from BART.
The config
sub-block details the model, as per the HuggingFace BART configuration. Provided settings replicate the bart-base model configuration.
Note: The vocab_size
parameter depends on the pre-trained tokenizer defined by lm/tokenizer
.
The generation
sub-block provides generation-specific settings (see the HuggingFace Generation documentation):
decoding
(str): beam
or greedy
decoding are supported.The freeze
sub-block enables freezing different components of the transformer (attention, MLP, self-attention or cross-attention).
Other parameters are:
eval_model
(str): Model selection at evaluation/inference. best
selects the best model according to validation loss at training, checkpoint
uses a specific checkpoint set by eval_checkpoint
. This setting can also be set to /path/to/model.bin
for custom trained model weights, e.g. the provided pre-trained weights.eval_checkpoint
(int): Model checkpoint to use at evaluation/inference. This is ignored unless eval_model
is set to checkpoint
.tokenizer
(str): Name of the HuggingFace pre-trained tokenizer.pretrained
(str, null): If not null, name of a HuggingFace pre-trained model (e.g. facebook/bart-base). Note that this will bypass all config
sub-block settings.The training
block describes parameters of the training process.
eval_steps
(int): Frequence of model validation, in training steps.save_steps
(int): Frequence of model weights saving, in training steps. If lm/eval_model
is set to best
, this should be a factor of eval_steps
.force_cpu
(bool): Force all computations on CPU, even when CUDA is available.batch_size
(int): Batch size during model training and validation.gradient_accumulation_steps
(int): Accumulates gradients over several steps, effectively increasing the batch size without additional memory cost. Gradient accumulation is disabled if this is set to 1.num_workers
(int): Number of CPU workers for data loading.lr
(float): Learning rate during training.nb_epochs
(int): Number of training epochs.seed
(int, null): Sets a specific torch random seed before experiments. Note that this does not ensure reproducibility when training on GPU.The workflow
block sets operations to be conducted in the experiment.
train
will perform optimization with data in the </path/to/data>/development
directory, where </path/to/data>
is the appended data/root_dir
and data/features_dir
settings.validate
must be set to true
during training if lm/eval_model
is set to best
. Validation is done on data in the </path/to/data>/validation
directory.evaluate
refers to evaluation with metrics, and outputs metrics_coco_<decoding_method>.json
and generated_captions_<decoding_method>.txt
files in the output/<exp_name>_out
directory, where <decoding_method>
is the lm/generation/decoding
setting. Evaluation is done on data in the </path/to/data>/evaluation
directory.infer
refers to caption generation without computing metrics. Inference outputs a submission-ready file test_output_captions_<decoding_method>.csv
. Inference is performed on data in the </path/to/data>/test
directory.