The official Pytorch implementation of "Frame-wise streaming end-to-end speaker diarization with non-autoregressive self-attention-based attractors".
This work is accepted by ICASSP 2024.
Paper :star_struck: | Issues :sweat_smile: | Lab :hear_no_evil: | Contact :kissing_heart:
This work proposes a frame-wise online/streaming end-to-end neural diarization (FS-EEND) method in a frame-in-frame-out fashion. To frame-wisely detect a flexible number of speakers and extract/update their corresponding attractors, we propose to leverage a causal speaker embedding encoder and an online non-autoregressive self-attention-based attractor decoder. A look-ahead mechanism is adopted to allow leveraging some future frames for effectively detecting new speakers in real time and adaptively updating speaker attractors.
git clone https://github.com/Audio-WestlakeU/FS-EEND.git
Prepare kaldi-style data by referring to here. Modify conf/xxx.yaml according to your own paths.
Start training on simulated data by
python train_dia.py --configs conf/spk_onl_tfm_enc_dec_nonautoreg.yaml --gpus YOUR_DEVICE_ID
python train_dia_fintn_ch.py --configs conf/spk_onl_tfm_enc_dec_nonautoreg_callhome.yaml --gpus YOUR_DEVICE_ID
python train_diaxxx.py --configs conf/xxx_infer.yaml --gpus YOUR_DEVICE_ID --test_from_folder YOUR_CKPT_SAVE_DIR
# generate speech activity probability (diarization results)
cd visualize
python gen_h5_output.py
python metrics.py --configs conf/xxx_infer.yaml
# Performance
Please note we use Switchboard Cellular (Part 1 and 2) and 2005-2008 NIST Speaker Recognition Evaluation (SRE) to generate simulated data (including 4054 speakers).
| Dataset | DER(%) |ckpt|
| :--------: | :--: | :--: |
| Simu1spk | 0.6 | [simu_avg_41_50epo.ckpt](https://drive.google.com/file/d/1JYr1zOxsHwQxIk9W4vwxzUfJFtaTQ02q/view?usp=sharing) |
| Simu2spk | 4.3 | same as above |
| Simu3spk | 9.8 | same as above |
| Simu4spk | 14.7 | same as above |
| CH2spk | 10.0 | [ch_avg_91_100epo.ckpt](https://drive.google.com/file/d/1i1Ow9IfPSwBRyRazY8-VX3z4ngDvSwx6/view?usp=sharing) |
| CH3spk | 15.3 | same as above |
| CH4spk | 21.8 | same as above |
The ckpts are the average of model parameters for the last 10 epochs.
If you want to check the performance of ckpt on CALLHOME:
python train_dia_fintn_ch.py --configs conf/spk_onl_tfm_enc_dec_nonautoreg_callhome_infer.yaml --gpus YOUR_DEVICE_ID, --test_from_folder YOUR_CKPT_SAVE_DIR
Note the modification of the code in train_dia_fintn_ch.py
ckpts = [x for x in all_files if (".ckpt" in x) and ("epoch" in x) and int(x.split("=")[1].split("-")[0])>=configs["log"]["start_epoch"] and int(x.split("=")[1].split("-")[0])<=configs["log"]["end_epoch"]]
state_dict = torch.load(test_folder + "/" + c, map_location="cpu")["state_dict"]
to
ckpts = [x for x in all_files if (".ckpt" in x)]
state_dict = torch.load(test_folder + "/" + c, map_location="cpu")
# Reference code
- <a href="https://github.com/hitachi-speech/EEND" target="_blank">EEND</a>
- <a href="https://github.com/Xflick/EEND_PyTorch" target="_blank">EEND-Pytorch</a>
# Citation
If you want to cite this paper:
@misc{liang2023framewise, title={Frame-wise streaming end-to-end speaker diarization with non-autoregressive self-attention-based attractors}, author={Di Liang and Nian Shao and Xiaofei Li}, year={2023}, eprint={2309.13916}, archivePrefix={arXiv}, primaryClass={eess.AS} }