Implementation of "End-to-end speaker segmentation for overlap-aware resegmentation" with modifications for speaker change detection. Learn more in the presentation.
This code is based on pyannote/pyannote-audio. Some functions are identical to those in pyannote.audio
, some are slightly modified, and some are heavily modified. Additionally, there is novel code to perform speaker change detection and to connect everything together.
This code can prepare data, train, and perform inference for two different tasks: speaker change detection and speaker segmentation. However, the outputs from both models/configurations can be processed into speaker change points.
Model Weights (including short_scd_bigdata.ckpt): Available from this Google Drive folder.
Training GIFs (more details in the presentation):
Speaker Change Detection | Segmentation |
---|---|
Speaker change detection identifies timestamps where the active speaker changes. If someone starts speaking, stops speaking, and starting speaking again (and no one else started speaking while they were not speaking), no speaker change occurs. If two people are speaking and one of them stops or another person starts speaking, a speaker change occurs. See slide 6 of the presentation.
Segmentation splits a conversation into turns. It identifies when people are speaking. This is not voice activity detection since if multiple people are talking the model will output probabilities indicating multiple speakers. This is not speaker diarization because speakers are not identified for the entire length of an audio file.
The code is mostly organized according to PyTorch Lightning's structure. Package management is handled by Poetry.
The dataset used is the AMI Meeting Corpus. It was downloaded and repaired using the scripts available in the pyannote/AMI-diarization-setup GitHub repository.
git clone --recurse-submodules https://github.com/HHousen/speaker-change-detection/ && cd speaker-change-detection
poetry install
then poetry shell
cd AMI-diarization-setup/pyannote && sh download_ami.sh
(more details)python train.py
. Set DO_SCD
in train.py to True
to do speaker change detection or set it to False
to do segmentation.short_scd_bigdata.ckpt
with the path to your model checkpoint and test_audio_similar.wav
with the path to your audio file. Set DO_SCD
to the same value used for training.train.py
: Execute to train a model on the dataset. Loads the data using a SegmentationAndSCDData
datamodule and instantiates a SSCDModel
model. Logs to Weights and Biases and trains on a GPU using the PyTorch Lightning Trainer
.model.py
: Defines the SSCDModel
model architecture, training loop, optimizers, loss function, etc.sincnet.py
: An implementation of the SincNet model, which is used in SSCDModel
, from this GitHub repo.data.py
: Defines the SegmentationAndSCDData
datamodule, which processes the data into the format accepted by the model. Uses pyannote.database
to load and do some initial processing of the data.inference.py
: Contains functions necessary to perform inference on a complete audio file. Can be used easily on a file by running process_file.py
.process_file.py
: Processes an audio file end-to-end using the Inference
object defined in inference.py
.process_file.ipynb
: Similar to process_file.py
, but as a Jupyter notebook to take advantage of pyannote.core
's plotting functions.Note: database.yml
tells pyannote.database
where the data is located.
This idea and code are primarily based on the paper "End-to-end speaker segmentation for overlap-aware resegmentation" by Hervé Bredin & Antoine Laurent.
Also, SincNet is a key component of the model architecture: "Speaker Recognition From Raw Waveform With Sincnet" by Mirco Ravanelli, Yoshua Bengio