Large-scale monolingual speech foundation models

Scripts for training large-scale monolingual speech foundation models with 158K hours of Finnish speech

Pre-trained and fine-tuned (4,600 hours) models

Model |--- wav2vec 2.0 Base Pre-trained wav2vec 2.0 Base Fine-tuned wav2vec 2.0 Large Pre-trained wav2vec 2.0 Large Fine-tuned wav2vec 2.0 X-Large Pre-trained wav2vec 2.0 X-Large Fine-tuned

More details on the models are available in the paper (TBA). The models are also available at Huggingface Hub

Training logs

Developing a foundation model from scratch requires not only vast amounts of unlabeled speech data but also substantial computational resources. Moreover, extensive hyperparameter search is often not feasible for large-scale models. Therefore, we are glad to share our pre-training logs on Weights & Biases (W&B) to provide more insights for other researchers developing their own speech foundation models.

Data pre-processing

The raw, unlabeled TV and radio data are organized into 1-hour files, each located in the directory channel_name/year/month/day/channel_name_start_time-end_time.ts:

.
└── raw_tv_and_radio_data/
    ├── radio_channel_1/
    │   ├── 2009/
    │   │   ├── 01/
    │   │   │   ├── 01/
    │   │   │   │   ├── radio_channel_1_0000-0100.ts
    │   │   │   │   ├── radio_channel_1_0100-0200.ts
    │   │   │   │   └── ...
    │   │   │   ├── 02/
    │   │   │   │   └── ...
    │   │   │   └── ...
    │   │   ├── 02/
    │   │   │   └── .../
    │   │   │       └── ...
    │   │   └── ...
    │   └── 2010/
    │       └── .../
    │           └── .../
    │               └── ...
    ├── tv_channel_2/
    │   └── ...
    └── ...

Convert the files to 16kHz mono flac audio by running scripts/data_preprocessing/convert_to_flac.sh. The script preserves the original folder structure.
Run voice activity detection (VAD) to split the data into shorter utterances and reduce the non-speech events, such as music, noise, and silence, and put them into uncompressed (.tar) tarballs, with one archive per year per radio station or TV channel. The script scripts/data_preprocessing/segment_with_vad_and_tar.sh does it for one year of the data from radio_station_1. The script also stores a Python dictionary out_file_to_nframes_dict with the number of frames for each audio segment, which will be needed later to create the Fairseq manifest of the data. Note: Fairseq does not support compressed archives Note: Millions of small files affect the performance of any filesystem. As a result, quotas on Lustre filesystems are typically limited to several million files. To avoid running out of quota, put the short audio files into a .tar archive after VAD-based segmentation of a small part of the raw data (one day, month, or year), and remove them immediately afterward. You can also consider storing the preprocessed audio files in the /tmp folder, which usually does not consume the quota.
Prepare the Fairseq manifest of the data. scripts/data_preprocessing/prepare_fairseq_manifest.sh creates a .tsv file with all radio_station_1 audio samples stored in the corresponding .tar archives. To hold out a validation subset valid_size_hours hours, run scripts/data_preprocessing/prepare_fairseq_manifest_valid_tsv.sh afterward.
Binarize the Fairseq manifest by running scripts/data_preprocessing/binarize_manifest.sh. This step is recommended for large datasets to avoid running out of RAM during pre-training.

Pre-training the models

The scripts shared in this repository are adapted to the AMD hardware of the LUMI supercomputer. To train a wav2vec 2.0 Base model, run

sbatch /scripts/pretraining/pretrain_wav2vec2_base.sh

Note: you can simulate 512 GPUs by using k GPUs and adding command line parameters (before --config-dir) distributed_training.distributed_world_size=k +optimization.update_freq='[x]' where x = 512/k

Fine-tuning the models with CTC

To fine-tune a wav2vec 2.0 Base model using Fairseq, run

sbatch scripts/finetuning/full-scale-asr/finetune_wav2vec2_base.sh

When pre-training on the LUMI supercomputer using Fairseq, it is crucial to set export MIOPEN_FIND_MODE=2. MIOpen is AMD’s deep-learning primitives library for GPUs (counterpart of NVIDIA's cuDNN). Setting the Find Mode to 2, or FAST is crucial for optimal pre-training speed, otherwise pre-training is 10-20x times slower. More details on MIOpen Find modes are available here
You can simulate 128 GPUs by using k GPUs and adding command line parameters (before --config-dir) distributed_training.distributed_world_size=k +optimization.update_freq='[x]' where x = 128/k
For more LUMI-specific details on training with AMD GPUs, see here, here, and here.

Fine-tuning the models with CTC using 🤗Transformers

To fine-tune a wav2vec 2.0 Base model using Huggingface Transformers, run

sbatch scripts/finetuning/low-resource-asr/finetune_wav2vec2_base.sh

Computing Layer Utilization Rate (LUR)

LUR

To calculate the neuron attributions for all layers using Integrated Gradients (IG), run scripts/interpretation/ig.sh. After that, run the notebook scripts/interpretation/compute_LUR.ipynb to visualize the Layer Utilization Rates (LURs).

More details on the LUR are available in the paper (TBA).

aalto-speech / large-scale-monolingual-speech-foundation-models

readme