A lighter-weight (perhaps!) Text-to-Speech for Chinese/Mandarin synthesize, inspired by Tacotron & FastSpeech2 & RefineGAN.
It is also my shitty graduation design project, just a toy, so lower your expectations :)
Since TransTacoS
is implemented in tensorflow while RefineGAN
in torch respectively, you could separate them by creating virtual envs, but they are likely not to conflict, thus you could try to put all these together:
tensorflow-gpu==1.14.0 tensorboard==1.14.0
following https://tensorflow.google.cn/install/pip
torch==1.8.0+cu1xx torchaudio==0.8.0
following https://pytorch.org/
, where cu1xx
is your cuda versionpip install -r requirements.txt
for the rest dependenciestranstacos/dataset/__skel__.py
Makefile
cd transtacos & make preprocess
to prepare acoustic features (linear/mel/f0/c0/zcr)cd transtacos & make train
to train TransTacoScd retunegan & make finetune
to train RetuneGAN using preprocessed linear spectrograms (rather than from raw wave)Makefile
cd transtacos & make server
to start TransTacoS headless HTTP server (default at port 5105)cd retunegan & make server
to start RetuneGAN headless HTTP server (default at port 5104)python app.py
to start the WebUI app (default at port 5103)http://localhost:5103
, now have a try!Frankly speaking, TransTacoS didn't improve any thing profoundly from Tacotron, but I just found that shallower network leads to lower mel_loss, so maybe simple embed+decoder is already enough :(
f0/sp/ap
features so that we can use WORLD vocoder
sp
is OK, because it resembles mel very muchap
requires to be carefully normalized, and accurate f0
is even harder to predictistft
f0
and dyn
so that vocoder might benefits
f0
and dyn
from only mel seems not the reasonablereference wav
using Griffin-Limu/v mask
by hand-tuned zrc/c0 threshold (for Split-G only)UNet-G
(encoder-decoder generator): modified from RefineGAN, we use the output of Griffin-Lim as reference wav, rather than an F0/C0-guided hand-crafted speech templateSplit-G
(split u/v generator): self-designed, inspired by Multi-Band MelGAN, but I found the generated quality is holy shit :(
ResStack
borrowed from MelGANResBlock
modified from HiFiGANMSD
(multi scale discriminator): borrowed from MelGAN, I think it's good for plosive consonantsMPD
(multi period discriminator): borrowed from HiFiGAN, I take it as a multiple MSDs' stack-upMTD
(multi stft discriminator): modified from UnivNet, it has two work modes depending on its input (MPSD seems better indeed ...)
MPSD
(multi parameter spectrogram discriminator): like in UnivNet, but we let it judge both phase part and magnitude partPHD
(phase discriminator): self-designed, care more about phase, since l_mstft
has already regulated magnitude[(mag_real, phase_real), (mag_fake, phase_fake)]
, thus distinguishes real/fake stft data[(mag_real, phase_real), (mag_real, phase_fake)]
, thus ONLY distinguishes real/fake phasel_adv
(adversarial loss): modified from HiFiGAN, but relativizedl_fm
(feature map loss): borrowed from MelGANl_mstft
(multi stft loss): modified from Parallel WaveGAN, but we calculate mel_loss rather than linear_lossl_env
(envlope loss): borrowed from RefineGANl_dyn
(dynamic loss): self-designed, inspired by l_env
l_sm
(strip mirror loss): self-designed, but might hurts audio quality :(Oh my dude, it's really a biggy feng-he monster :(
encode-merge-decode
UNet architecture
b/p/g/k/d/t
), while mstft loss
contributes little to consonantshop_length
in stft is much larger that stride
in discriminators, thus mstft loss is usually more coarse than adversarial loss in time domain Codes referred to:
Ideas plagiarized from:
code release kept under the MIT license, greatest thanks all the authors!! :)
by Armit 2022/02/15 2022/05/25