PyTorch implementation of RPNSD. Our code is largely based on a Faster R-CNN implementation faster-rcnn.pytorch by jwyang.
git clone
variable in
, the current default is ~/anaconda3/bin
.conda install pytorch==0.4.0 cuda91 torchvision pillow"<7" -c pytorch
pip install -r requirements.txt
cd tools
make KALDI=<path/to/a/compiled/kaldi/directory>
# Select the backend used by from "local", "sge", "slurm", or "ssh"
The purpose of this step includes
Training on the Mixer6 + SRE + SWBD dataset. Default setting uses single GPU and takes about 4 days.
Pretrained model is available at pretrain-model.
Adapt the model on in-domain data. Since we use 5 folds cross validation, each time we train on 400 utterances from CALLHOME dataset and test on 100.
Inference stage.
One example from CALLHOME dataset. The first stream is the ground truth label, the second stream is the x-vector system, and the third stream is RPNSD.
Title={Speaker Diarization with Region Proposal Network},
Author={Huang, Zili and Watanabe, Shinji and Fujita, Yusuke and Garcia, Paola and Shao, Yiwen and Povey, Daniel and Khudanpur, Sanjeev},
Booktitle={Accepted to ICASSP 2020},
Author = {Jianwei Yang and Jiasen Lu and Dhruv Batra and Devi Parikh},
Title = {A Faster Pytorch Implementation of Faster R-CNN},
Journal = {},
Year = {2017}