Closed CindyTing closed 1 year ago
Hi, thanks for your interest in our work.
Hi, thanks very much for your reply! sorry for the late reply, I was trying to figure it out. I downloaded the checkpoint, and try to skip the pretrain and train part, but I found it was impossible. because I need the pretrain stage to generate the "tts_pretrain_1/dump" and "tts_pretrain_1/data". I tried to pre-train, but it's like after 5 days, still haven't finished. is it possible that u supply these two also? Thanks again!
Hi, sorry for the delay.
You can generate tts_pretrain_1/dump
and tts_pretrain_1/data
by running only the preprocessing without training as:
$ ./run.sh --stage 1 --stop-stage 4
Hello,
- I have uploaded the pretrained model checkpoints to hugging face: Byte-based model and IPA-based model. These models are trained with the settings
I am trying to follow the instructions on HuggingFace, as i only want to use the pretrained model for inference.
git checkout 11a7d61312439111d4996d55935ede718d494262
causes fatal: reference is not a tree: 11a7d61312439111d4996d55935ede718d494262
I did git clone https://github.com/Takaaki-Saeki/zm-text-tts
instead
cd egs2/masmultts/tts_phn_css10_adap_residual_freeze
This path does not exist. Do you mean egs2/masmultts/tts1
?
./run.sh --skip_data_prep false --skip_train true --download_model saefro991/tts_ipa_css10_7lang_textpretrain_residual_freeze
This causes error
2023-10-12T00:45:11 (data.sh:85:main) stage 0: local/data_prep.py
Processing CSS10 ...
Traceback (most recent call last):
File "/home/perry/PycharmProjects/zm-text-tts/egs2/masmultts/tts1/local/data_prep.py", line 334, in <module>
main()
File "/home/perry/PycharmProjects/zm-text-tts/egs2/masmultts/tts1/local/data_prep.py", line 316, in main
DataProcessor(
File "/home/perry/PycharmProjects/zm-text-tts/egs2/masmultts/tts1/local/data_prep.py", line 112, in __init__
with open(tsv_path_norm, "r") as fr:
FileNotFoundError: [Errno 2] No such file or directory: 'MasMulTTS/css10.tsv'
@Takaaki-Saeki @CindyTing
To simply run inference with the pretrained model on a different dataset, these are my steps:
git clone https://github.com/Takaaki-Saeki/zm-text-tts
# Install espnet
cd zm-text-tts/tools
./setup_anaconda.sh ${CONDA_PREFIX} zm-text-tts 3.10
conda activate zm-text-tts
make TH_VERSION=1.13.1 CUDA_VERSION=11.7
cd ..
pip install -e .
# Download the pretrained model
cd egs2
git clone https://huggingface.co/saefro991/tts_ipa_css10_7lang_textpretrain_residual_freeze
# Move the model so that the dump and exp folders are together with conf, scripts, utils etc in standard espnet format
mv tts_ipa_css10_7lang_textpretrain_residual_freeze/* tts1
cd tts1
If your language has IPA symbols not in exp/tts_train_raw_phn_none/config.yaml
you have to find substitutes. In my case, dealing with Mandarin data, i had to create a custom pinyin to IPA converter. Also, the provided config.yaml
has extra keys. Comment out lang_family_encoding
and num_lang_family
, otherwise you cannot load the pretrained model.
Copy or symlink the matmultts/tts1/exp
to your own dataset/tts1
. Install the vocoder packages:
git clone https://github.com/kan-bayashi/ParallelWaveGAN.git
cd ParallelWaveGAN
pip install -e .
I wanted to batch-run on GPU, so i used the below notebook code to test. Simply define a list of IPA symbols to input and call save_wav(phones, lid=lid)
.
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
from pathlib import Path
import soundfile as sf
import torch
from espnet2.text.token_id_converter import TokenIDConverter
from parallel_wavegan.utils import load_model as load_vocoder
from espnet2.bin.tts_inference import Text2Speech
PWD = %pwd
AISHELL_DIR = Path(PWD)
device = 'cuda'
#%%
corpus_dir=AISHELL_DIR
cwd = os.getcwd()
os.chdir(corpus_dir)
pretrained_dir = corpus_dir / "exp/tts_train_raw_phn_none"
pretrained_model_file = pretrained_dir / "latest.pth"
pretrained_tts = Text2Speech.from_pretrained(
train_config=pretrained_dir / 'config.yaml',
model_file=pretrained_model_file,
device=device
)
pretrained_model = pretrained_tts.model
os.chdir(cwd)
#%%
vocoder_ckpt = '/home/perry/PycharmProjects/vocoders/hifigan16k_libritts_css10_vctk/checkpoint-2000000steps.pkl'
vocoder = load_vocoder(vocoder_ckpt)
vocoder.remove_weight_norm()
vocoder = vocoder.eval().to(device)
#%%
id_converter = TokenIDConverter(pretrained_tts.train_args.token_list)
save_dir = Path('/home/perry/PycharmProjects/present/prosody/outputs/zm-text-tts/aishell3')
#%%
pretrained_model.tts.use_gst = False
#%%
from kaldiio import ReadHelper
def read_xvectors(tts_dir, filename):
filetype = filename.split('.')[-1]
pwd = os.getcwd()
os.chdir(tts_dir)
with ReadHelper(f'{filetype}:{filename}') as reader:
xvector_dict = {i: xvector for i, xvector in reader}
os.chdir(pwd)
return xvector_dict
#%%
spk_xvectors = read_xvectors(str(pretrained_dir.parent), "dump/xvector/test/spk_xvector.ark")
#%%
lid2spk = {
2: 'css10_de', 3: 'css10_el', 8: 'css10_fi', 9: 'css10_fr', 11: 'css10_hu', 14: 'css10_nl', 17: 'css10_ru',
}
#%%
def save_wav(phones, filename='', lid=None, **kwargs):
with torch.no_grad():
phone_ids = torch.IntTensor(id_converter.tokens2ids(phones)).to(device)
if lid is None:
pretrained_model.tts.use_encoder_w_lid = False
output_dic = pretrained_model.tts.inference(text=phone_ids, **kwargs)
else:
pretrained_model.tts.use_encoder_w_lid = True
lids = torch.tensor([lid]).to(device)
if lid in lid2spk:
spk = lid2spk[lid]
spembs = torch.tensor(spk_xvectors[spk]).to(device).squeeze()
pretrained_model.tts.spk_embed_dims = len(spembs)
output_dic = pretrained_model.tts.inference(text=phone_ids, lids=lids, spembs=spembs, **kwargs)
else:
pretrained_model.tts.spk_embed_dim = None
output_dic = pretrained_model.tts.inference(text=phone_ids, lids=lids, **kwargs)
mel = output_dic['feat_gen']
wav = vocoder.inference(mel, normalize_before=False).view(-1)
os.makedirs(save_dir, exist_ok=True)
if not filename:
filename = f'{"".join(phones)}{lid}.wav'
filename = filename.replace('<sos/eos>', '_')
# save as PCM 16 bit wav file
sf.write(
save_dir / filename,
wav.detach().cpu().numpy(),
16000,
"PCM_16",
)
#%%
phones = 'p i t a <sos/eos>'.split()
save_wav(phones, lid=None)
for lid in lid2spk.keys():
save_wav(phones, lid=lid)
Unfortunately the results are bad no matter what language ID or xvector i use (or None). Here's a simple 'p i t a' in all the 7 CSS languages: pita.zip
Seems like Transformer-TTS has the repeat and skip phonemes problem.
Note: if you want to do anything else (for example getting the x-vectors from your own dataset), you have to prepare the datasets according to ESPnet recipes (in my case it was AISHELL-3). I copied egs2/aishell3
from espnet, and modified data_prep.sh
and tts.sh
to only run on the test set.
./run.sh --stage 1 --stop-stage 1
Here, data/*_phn/text
needs to be converted to IPA that the pretrained model can accept.
Then you should be able to run decode_hifigan/run.sh
on your dataset (changing the options according to your config).
Thank you so much for your great work! Also, sorry for the late reply. However, the synthetic speech should be much better for the seven CSS10 languages. I noticed that you have set use_gst to False.
pretrained_model.tts.use_gst = False
However, in our training config, we use GST as shown below. https://github.com/Takaaki-Saeki/zm-text-tts/blob/master/egs2/masmultts/tts1/conf/tuning/train_gst%2Bxvector_transformer.yaml
Therefore, for the inference of the pretrained model to work well, you might need to check that the model matches perfectly with the training config.
Thanks for your excellent work and opening source! I have a few questions as follows: