Source code for paper "Who is real Bob? Adversarial Attacks on Speaker Recognition Systems".
Demonstration Website: FAKEBOB Website (including a One-Minute Video Preview)
Our paper has been accepted by 42nd IEEE Symposium on Security and Privacy (IEEE S&P, Oakland), 2021.
Paper link Who is real Bob? Adversarial Attacks on Speaker Recognition Systems.
Oakland 2021 Presentation Slide Session #5-GuangkeChen-WhoisRealBob
Oakland 2021 Talk Video: Presentation Video
Cite our paper as follow:
@INPROCEEDINGS {chen2019real,
author = {G. Chen and S. Chen and L. Fan and X. Du and Z. Zhao and F. Song and Y. Liu},
booktitle = {2021 IEEE Symposium on Security and Privacy (SP)},
title = {Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems},
year = {2021},
issn = {2375-1207},
pages = {55-72},
doi = {10.1109/SP40001.2021.00004},
}
For example, suppose adver_thresh=0, and the initial loss of a benign voice is 0.01. This voice is not robust enough, so add a random noise is enough to make it adversarial. Hence, the samples_per_draw noisy samples are adversarial voices. If the outermost maximum operation is retained, all the loss of these noisy samples are zero. So the estimated gradient remains zero, and the sample will not be updated throughout the iteration procedure.
That 's why we can only achieve 99% ASR, and not 100% in our paper. Using the updated loss function, we can achieve 100% ASR.
You can either use the docker environment (recommended) or follow the manual installation.
To set up the environment, run docker/setup.sh
.
This will build the docker environment and download all files.
After the setup is complete, you should be able to enter the docker environment by running docker/run.sh
.
Run docker/run.sh
and then inside the docker environment python3 test.py
.
Alternatively you can run docker/run.sh python3 test.py
to run the command in the docker environment.
The test.py
script will show you the baseline performance.
Run docker/run.sh
and then ./attackMain.sh
or run docker/run.sh ./attackMain.sh
.
You can find details in step nine of the manual installation.
Reproduce our experiment step by step:
Install Python 3.
Download and install Kaldi Toolkit.
After successfully building Kaldi toolkit, run the egs/voxceleb/v1 recipe.
./run.sh
. You may need to resolve permission denied problem by shell command chmod 777 ./run.sh
After egs/voxceleb/v1 recipe completes running, copy (or make soft link for) the following files or directories to the pre-models directory of this reposity.
Downoad our dataset. Our used dataset comes from LibriSpeech. We just select part of it and convert the .flac format to .wav format.
Setting the System Variable which kaldi relies on. We have made it as simple as possible. You just need to copy and paste all the commands in path_cmd.sh to your ~/.bashrc. Do not forget to modify KALDI_ROOT and FAKEBOB_PATH in path_cmd.sh.
vim ~/.bashrc
source ~/.bashrc
Build speaker unique models for enrolled speakers in enrollment-set.
build_spk_models.py
. Testing the baseline performance of ivector-PLDA-based and GMM-UBM-based OSI, CSI, SV systems.
test.py
. Generate adversarial voices for speaker recognition systems (launch our attack FAKEBOB).
archi=iv, task=CSI,attack_type=targeted
.epsilon
.n_jobs
and samples_per_draw
to decrease this time.samples_per_draw
may lead to slower covergence and too large n_jobs
may increase the time in contrast.Since step 3 consumes quite a long time, you can choose to download pre-models.tgz from pre-models.tgz, 493MB. The MD5 should be a61fd0f2a470eb6f9bdba11f5b71a123. After downloading, just untar it to the location of FAKEBOB. Then you can skip both step 3 and step 4.
If you would like to use your own dataset or attack other speaker recognition systems (e.g., DNN-based systems), here are some tips.
Use your own dataset
Each audio file should be in .wav format and named in the form of ID-XXX.wav (e.g., 1580-141084-0048.wav) where ID denotes the unique speaker id of the speaker.
You can exploit ffmpeg tool to change the format of your own audios.
If your audios are not sampled at 16 KHZ or not quantized with 16 bits, you need to modify thefs
andbits_per_sample
in some .py file or specific them when calling some functions.Here are the specification of the four sub directories in data/. It should help you prepare your own dataset more easily.
- enrollment-set: voice datas used for enrollment. One audio corresponds to one enrolled speaker.
- z-norm-set: voice data from imposters to obtain z_norm mean and z_norm std for scoring normalization.
- test-set: each sub-folder contains voice datas of one enrolled speaker in enrollment-set.
The name of each sub-folder should be the unique speaker ID of the corresponding speaker.- illegal-set: each sub-folder contains voice datas of one imposter (imposters are not in enrollment-set).
The name of each sub-folder should be the unique speaker ID of the corresponding speaker.Attack other speaker recognition systems
To generate adversarial voices on other speaker recognition systems using our attack FAKEBOB, what you need to do is quite simple: just wrap your system by providing two interface - function
score
andmake_decisions
. Please refer togmm_ubm_OSI.py
,gmm_ubm_CSI.py
,gmm_ubm_SV.py
,ivector_PLDA_OSI.py
,ivector_PLDA_CSI.py
,ivector_PLDA_SV.py
for details about the input and output arguments.