Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.
Abstract: We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
Requirements:
git clone https://github.com/facebookresearch/speech-resynthesis.git
cd speech-resynthesis
pip install -r requirements.txt
data/LJSpeech-1.1
folder.bash
python ./scripts/preprocess.py \
--srcdir data/LJSpeech-1.1/wavs \
--outdir data/LJSpeech-1.1/wavs_16khz \
--pad
data/VCTK-Corpus
folder.python ./scripts/preprocess.py \
--srcdir data/VCTK-Corpus/wav48_silence_trimmed \
--outdir data/VCTK-Corpus/wav16_silence_trimmed_padded \
--pad --postfix mic2.flac
To train F0 quantizer model, use the following command:
python -m torch.distributed.launch --nproc_per_node 8 train_f0_vq.py \
--checkpoint_path checkpoints/lj_f0_vq \
--config configs/LJSpeech/f0_vqvae.json
Set <NUM_GPUS>
to the number of availalbe GPUs on your machine.
To train a resynthesis model, use the following command:
python -m torch.distributed.launch --nproc_per_node <NUM_GPUS> train.py \
--checkpoint_path checkpoints/lj_vqvae \
--config configs/LJSpeech/vqvae256_lut.json
Currently, we support the following training schemes:
Dataset | SSL Method | Dictionary Size | Config Path |
---|---|---|---|
LJSpeech | HuBERT | 100 | configs/LJSpeech/hubert100_lut.json |
LJSpeech | CPC | 100 | configs/LJSpeech/cpc100_lut.json |
LJSpeech | VQVAE | 256 | configs/LJSpeech/vqvae256_lut.json |
VCTK | HuBERT | 100 | configs/VCTK/hubert100_lut.json |
VCTK | CPC | 100 | configs/VCTK/cpc100_lut.json |
VCTK | VQVAE | 256 | configs/VCTK/vqvae256_lut.json |
To generate, simply run:
python inference.py \
--checkpoint_file checkpoints/vctk_cpc100 \
-n 10 \
--output_dir generations
To synthesize multiple speakers:
python inference.py \
--checkpoint_file checkpoints/vctk_cpc100 \
-n 10 \
--vc \
--input_code_file datasets/VCTK/cpc100/test.txt \
--output_dir generations_multispkr
You can also generate with codes from a different dataset:
python inference.py \
--checkpoint_file checkpoints/lj_cpc100 \
-n 10 \
--input_code_file datasets/VCTK/cpc100/test.txt \
--output_dir generations_vctk_to_lj
To quantize new datasets with CPC or HuBERT follow the instructions described in the GSLM code.
To parse CPC output:
python scripts/parse_cpc_codes.py \
--manifest cpc_output_file \
--wav-root wav_root_dir \
--outdir parsed_cpc
To parse HuBERT output:
python parse_hubert_codes.py \
--codes hubert_output_file \
--manifest hubert_tsv_file \
--outdir parsed_hubert
First, you will need to download LibriLight dataset and move it to data/LibriLight
.
For VQVAE, train a vqvae model using the following command:
python -m torch.distributed.launch --nproc_per_node <NUM_GPUS> train.py \
--checkpoint_path checkpoints/ll_vq \
--config configs/LibriLight/vqvae256.json
To extract VQVAE codes:
python infer_vqvae_codes.py \
--input_dir folder_with_wavs_to_code \
--output_dir vqvae_output_folder \
--checkpoint_file checkpoints/ll_vq
To parse VQVAE output:
python parse_vqvae_codes.py \
--manifest vqvae_output_file \
--outdir parsed_vqvae
You may find out more about the license here.
@inproceedings{polyak21_interspeech,
author={Adam Polyak and Yossi Adi and Jade Copet and
Eugene Kharitonov and Kushal Lakhotia and
Wei-Ning Hsu and Abdelrahman Mohamed and Emmanuel Dupoux},
title={{Speech Resynthesis from Discrete Disentangled Self-Supervised Representations}},
year=2021,
booktitle={Proc. Interspeech 2021},
}
This implementation uses code from the following repos: HiFi-GAN and Jukebox, as described in our code.