This is merely an experiment, the dataset an4
used for the experiment is not suitable for training a speaker verification model, it was only used for fast training.
Nemo's examples and tutorials do not provide explicit illustrations to speaker verification. This will be an experiment of Speaker verification with NeMo.
This step has been done in a previous experiment, manifest files were directly moved to the data folder of this repo. The following describes how manifests were obtained.
Run download_and_convert_an4.py
to download dataset an4
, and convert .sph
to .wav
.
Generate manifest files:
find {data_dir}/an4/wav/an4_clstk -iname "*.wav" > data/an4/wav/an4_clstk/train_all.scp
preview
head -n 3 {data_dir}/an4/wav/an4_clstk/train_all.scp
Convert .scp
to manifest
, set the --split flag for splitting training and development set:
python {path-to/scp_to_manifest.py} --scp {paths-to/train_all.scp} --id -2 --out {path-to-opt/all_manifest.json} --split
scp_to_manifest.py
src: https://github.com/NVIDIA/NeMo/blob/main/scripts/scp_to_manifest.py
configuration file src: https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_recognition/conf/SpeakerNet_verification_3x2x512.yaml
Run scripts/train_spk_ver_model.py
to train a speaker verification model.
First generate embeddings_manifest.json
for test. For the purpose of this experiment I created this manifest manually.
Run get_embs.py
.
As the purpose of this experiment is to verify my voice, the recordings were of me, the recordings are not uploaded in this repo. (Audios for test were recorded on Praat
.)
Calculate cosine-similarity of the two speaker embeddings to see the certainty of this model of two audios being from the same speaker.
A first experiment score: 0.9686643297704525