Few-Shot Multi-Recording Speaker Identification Transformer Fine-Tuning and Application
Stable Release: pip install speakerbox
Development Head: pip install git+https://github.com/CouncilDataProject/speakerbox.git
For full package documentation please visit councildataproject.github.io/speakerbox.
Link: https://youtu.be/SK2oVqSKPTE
In the example video, we use the Speakerbox library to quickly annotate a dataset of audio clips from the show The West Wing and train a speaker identification model to identify three of the show's characters (President Bartlet, Charlie Young, and Leo McGarry).
Given a set of recordings of multi-speaker recordings:
example/
├── 0.wav
├── 1.wav
├── 2.wav
├── 3.wav
├── 4.wav
└── 5.wav
Where each recording has some or all of a set of speakers, for example:
You want to train a model to classify portions of audio as one of the N known speakers in future recordings not included in your original training set.
f(audio) -> [(start_time, end_time, speaker), (start_time, end_time, speaker), ...]
i.e. f(audio) -> [(2.4, 10.5, "A"), (10.8, 14.1, "D"), (14.8, 22.7, "B"), ...]
The speakerbox
library contains methods for both generating datasets for annotation
and for utilizing multiple audio annotation schemes to train such a model.
The following table shows model performance results as the dataset size increases:
dataset_size | mean_accuracy | mean_precision | mean_recall | mean_training_duration_seconds |
---|---|---|---|---|
15-minutes | 0.874 ± 0.029 | 0.881 ± 0.037 | 0.874 ± 0.029 | 101 ± 1 |
30-minutes | 0.929 ± 0.006 | 0.94 ± 0.007 | 0.929 ± 0.006 | 186 ± 3 |
60-minutes | 0.937 ± 0.02 | 0.94 ± 0.017 | 0.937 ± 0.02 | 453 ± 7 |
All results reported are the average of five model training and evaluation trials for each of the different dataset sizes. All models were fine-tuned using an NVIDIA GTX 1070 TI.
Note: this table can be reproduced in ~1 hour using an NVIDIA GTX 1070 TI by:
Installing the example data download dependency:
pip install speakerbox[example_data]
Then running the following commands in Python:
from speakerbox.examples import (
download_preprocessed_example_data,
train_and_eval_all_example_models,
)
# Download and unpack the preprocessed example data
dataset = download_preprocessed_example_data()
# Train and eval models with different subsets of the data
results = train_and_eval_all_example_models(dataset)
We quickly generate an annotated dataset by first diarizing (or clustering based on the features of speaker audio) portions of larger audio files and splitting each the of the clusters into their own directories that you can then manually clean up (by removing incorrectly clustered audio segments).
⚠️ To use the diarization portions of speakerbox
you need to complete the
following steps: ⚠️
Diarize a single file:
from speakerbox import preprocess
# The token can also be provided via the 'HUGGINGFACE_TOKEN` environment variable.
diarized_and_split_audio_dir = preprocess.diarize_and_split_audio(
"0.wav",
hf_token="token-from-hugging-face",
)
Diarize all files in a directory:
from speakerbox import preprocess
from pathlib import Path
from tqdm import tqdm
# Iterate over all 'wav' format files in a directory called 'data'
for audio_file in tqdm(list(Path("data").glob("*.wav"))):
# The token can also be provided via the 'HUGGINGFACE_TOKEN` environment variable.
diarized_and_split_audio_dir = preprocess.diarize_and_split_audio(
audio_file,
# Create a new directory to place all created sub-directories within
storage_dir=f"diarized-audio/{audio_file.stem}",
hf_token="token-from-hugging-face",
)
Diarization will produce a directory structure organized by unlabeled speakers with the audio clips that were clustered together.
For example, if "0.wav"
had three speakers, the produced directory structure may look
like the following tree:
0/
├── SPEAKER_00
│ ├── 567-12928.wav
│ ├── ...
│ └── 76192-82901.wav
├── SPEAKER_01
│ ├── 34123-38918.wav
│ ├── ...
│ └── 88212-89111.wav
└── SPEAKER_02
├── ...
└── 53998-62821.wav
We leave it to you as a user to then go through these directories and remove any audio clips that were incorrectly clustered together as well as renaming the sub-directories to their correct speaker labels. For example, labelled sub-directories may look like the following tree:
0/
├── A
│ ├── 567-12928.wav
│ ├── ...
│ └── 76192-82901.wav
├── B
│ ├── 34123-38918.wav
│ ├── ...
│ └── 88212-89111.wav
└── D
├── ...
└── 53998-62821.wav
Once you have annotated what you think is enough recordings, you can try preparing a dataset for training.
The following functions will prepare the audio for training by:
from speakerbox import preprocess
dataset = preprocess.expand_labeled_diarized_audio_dir_to_dataset(
labeled_diarized_audio_dir=[
"0/", # The cleaned and checked audio clips for recording id 0
"1/", # ... recording id 1
"2/", # ... recording id 2
"3/", # ... recording id 3
"4/", # ... recording id 4
"5/", # ... recording id 5
]
)
dataset_dict, value_counts = preprocess.prepare_dataset(
dataset,
# good if you have large variation in number of data points for each label
equalize_data_within_splits=True,
# set seed to get a reproducible data split
seed=60,
)
# You can print the value_counts dataframe to see how many audio clips of each label
# (speaker) are present in each data subset.
value_counts
Once you have your dataset prepared and available, you can provide it directly to the training function to begin training a new model.
The eval_model
function will store a filed called results.md
with the accuracy,
precision, and recall of the model and additionally store a file called
validation-confusion.png
which is a
confusion matrix.
trained-speakerbox
(parametrizable).from speakerbox import train, eval_model
# dataset_dict comes from previous preparation step
train(dataset_dict)
eval_model(dataset_dict["valid"])
Once you have a trained model, you can use it against a new audio file:
from speakerbox import apply
annotation = apply(
"new-audio.wav",
"path-to-my-model-directory/",
)
The apply function returns a pyannote.core.Annotation.
See CONTRIBUTING.md for information related to developing the code.
@article{Brown2023,
doi = {10.21105/joss.05132},
url = {https://doi.org/10.21105/joss.05132},
year = {2023},
publisher = {The Open Journal},
volume = {8},
number = {83},
pages = {5132},
author = {Eva Maxfield Brown and To Huynh and Nicholas Weber},
title = {Speakerbox: Few-Shot Learning for Speaker Identification with Transformers},
journal = {Journal of Open Source Software}
}
MIT License