Takaaki-Saeki / zm-text-tts

[IJCAI'23] Learning to Speak from Text for Low-Resource TTS
Apache License 2.0
64 stars 2 forks source link

Can you provide the pretrained tts model checkpoint? #1

Closed CindyTing closed 1 year ago

CindyTing commented 1 year ago

Thanks for your excellent work and opening source! I have a few questions as follows:

  1. Since it may take days or weeks to pre-train it again. Can you provide the pre-trained tts model checkpoint? It will be convenient for us to recap, thank you very much ahead!
  2. You are using x-vector for every language to extract the speaker embedding, since x-vector is only pre-trained in English, it may impact the performance. do u have any advice to improve this part? any recommendations like a multilingual-based speaker-exactor model?
Takaaki-Saeki commented 1 year ago

Hi, thanks for your interest in our work.

  1. I have uploaded the pretrained model checkpoints to hugging face: Byte-based model and IPA-based model. These models are trained with the settings as in Section 3.2 and 3.3 of the paper, so you can use the seven languages as the seen languages and Spanish as the unseen language. Sorry the byte-based huggingface repo shows weird security warnings on the checkpoint, but I think it is OK.
  2. Yes, we can expect to improve the performance by using x-vectors trained on multilingual speech corpora. In addition, introducing adversarial learning to disentangle language features and speaker characteristics would be an interesting approach.
CindyTing commented 1 year ago

Hi, thanks very much for your reply! sorry for the late reply, I was trying to figure it out. I downloaded the checkpoint, and try to skip the pretrain and train part, but I found it was impossible. because I need the pretrain stage to generate the "tts_pretrain_1/dump" and "tts_pretrain_1/data". I tried to pre-train, but it's like after 5 days, still haven't finished. is it possible that u supply these two also? Thanks again!

Takaaki-Saeki commented 1 year ago

Hi, sorry for the delay. You can generate tts_pretrain_1/dump and tts_pretrain_1/data by running only the preprocessing without training as:

$ ./run.sh --stage 1 --stop-stage 4
iamanigeeit commented 1 year ago

Hello,

  1. I have uploaded the pretrained model checkpoints to hugging face: Byte-based model and IPA-based model. These models are trained with the settings

I am trying to follow the instructions on HuggingFace, as i only want to use the pretrained model for inference.

git checkout 11a7d61312439111d4996d55935ede718d494262

causes fatal: reference is not a tree: 11a7d61312439111d4996d55935ede718d494262

I did git clone https://github.com/Takaaki-Saeki/zm-text-tts instead

cd egs2/masmultts/tts_phn_css10_adap_residual_freeze

This path does not exist. Do you mean egs2/masmultts/tts1?

./run.sh --skip_data_prep false --skip_train true --download_model saefro991/tts_ipa_css10_7lang_textpretrain_residual_freeze

This causes error

2023-10-12T00:45:11 (data.sh:85:main) stage 0: local/data_prep.py
Processing CSS10 ...
Traceback (most recent call last):
  File "/home/perry/PycharmProjects/zm-text-tts/egs2/masmultts/tts1/local/data_prep.py", line 334, in <module>
    main()
  File "/home/perry/PycharmProjects/zm-text-tts/egs2/masmultts/tts1/local/data_prep.py", line 316, in main
    DataProcessor(
  File "/home/perry/PycharmProjects/zm-text-tts/egs2/masmultts/tts1/local/data_prep.py", line 112, in __init__
    with open(tsv_path_norm, "r") as fr:
FileNotFoundError: [Errno 2] No such file or directory: 'MasMulTTS/css10.tsv'
iamanigeeit commented 1 year ago

@Takaaki-Saeki @CindyTing

To simply run inference with the pretrained model on a different dataset, these are my steps:

git clone https://github.com/Takaaki-Saeki/zm-text-tts
# Install espnet
cd zm-text-tts/tools
./setup_anaconda.sh ${CONDA_PREFIX} zm-text-tts 3.10
conda activate zm-text-tts
make TH_VERSION=1.13.1 CUDA_VERSION=11.7
cd ..
pip install -e .
# Download the pretrained model
cd egs2
git clone https://huggingface.co/saefro991/tts_ipa_css10_7lang_textpretrain_residual_freeze
# Move the model so that the dump and exp folders are together with conf, scripts, utils etc in standard espnet format
mv tts_ipa_css10_7lang_textpretrain_residual_freeze/* tts1
cd tts1

If your language has IPA symbols not in exp/tts_train_raw_phn_none/config.yaml you have to find substitutes. In my case, dealing with Mandarin data, i had to create a custom pinyin to IPA converter. Also, the provided config.yaml has extra keys. Comment out lang_family_encoding and num_lang_family, otherwise you cannot load the pretrained model.

Copy or symlink the matmultts/tts1/exp to your own dataset/tts1. Install the vocoder packages:

git clone https://github.com/kan-bayashi/ParallelWaveGAN.git
cd ParallelWaveGAN
pip install -e .

I wanted to batch-run on GPU, so i used the below notebook code to test. Simply define a list of IPA symbols to input and call save_wav(phones, lid=lid).

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
from pathlib import Path

import soundfile as sf
import torch
from espnet2.text.token_id_converter import TokenIDConverter
from parallel_wavegan.utils import load_model as load_vocoder
from espnet2.bin.tts_inference import Text2Speech

PWD = %pwd
AISHELL_DIR = Path(PWD)
device = 'cuda'
#%%
corpus_dir=AISHELL_DIR
cwd = os.getcwd()
os.chdir(corpus_dir)
pretrained_dir = corpus_dir / "exp/tts_train_raw_phn_none"
pretrained_model_file = pretrained_dir / "latest.pth"
pretrained_tts = Text2Speech.from_pretrained(
    train_config=pretrained_dir / 'config.yaml',
    model_file=pretrained_model_file,
    device=device
)
pretrained_model = pretrained_tts.model
os.chdir(cwd)
#%%
vocoder_ckpt = '/home/perry/PycharmProjects/vocoders/hifigan16k_libritts_css10_vctk/checkpoint-2000000steps.pkl'
vocoder = load_vocoder(vocoder_ckpt)
vocoder.remove_weight_norm()
vocoder = vocoder.eval().to(device)
#%%
id_converter = TokenIDConverter(pretrained_tts.train_args.token_list)
save_dir = Path('/home/perry/PycharmProjects/present/prosody/outputs/zm-text-tts/aishell3')
#%%
pretrained_model.tts.use_gst = False
#%%
from kaldiio import ReadHelper
def read_xvectors(tts_dir, filename):
    filetype = filename.split('.')[-1]
    pwd = os.getcwd()
    os.chdir(tts_dir)
    with ReadHelper(f'{filetype}:{filename}') as reader:
        xvector_dict = {i: xvector for i, xvector in reader}
    os.chdir(pwd)
    return xvector_dict
#%%
spk_xvectors = read_xvectors(str(pretrained_dir.parent), "dump/xvector/test/spk_xvector.ark")
#%%
lid2spk = {
    2: 'css10_de', 3: 'css10_el', 8: 'css10_fi', 9: 'css10_fr', 11: 'css10_hu', 14: 'css10_nl', 17: 'css10_ru',
}
#%%
def save_wav(phones, filename='', lid=None, **kwargs):
    with torch.no_grad():
        phone_ids = torch.IntTensor(id_converter.tokens2ids(phones)).to(device)
        if lid is None:
            pretrained_model.tts.use_encoder_w_lid = False
            output_dic = pretrained_model.tts.inference(text=phone_ids, **kwargs)
        else:
            pretrained_model.tts.use_encoder_w_lid = True
            lids = torch.tensor([lid]).to(device)
            if lid in lid2spk:
                spk = lid2spk[lid]
                spembs = torch.tensor(spk_xvectors[spk]).to(device).squeeze()
                pretrained_model.tts.spk_embed_dims = len(spembs)
                output_dic = pretrained_model.tts.inference(text=phone_ids, lids=lids, spembs=spembs, **kwargs)
            else:
                pretrained_model.tts.spk_embed_dim = None
                output_dic = pretrained_model.tts.inference(text=phone_ids, lids=lids, **kwargs)
        mel = output_dic['feat_gen']
        wav = vocoder.inference(mel, normalize_before=False).view(-1)
    os.makedirs(save_dir, exist_ok=True)
    if not filename:
        filename = f'{"".join(phones)}{lid}.wav'
        filename = filename.replace('<sos/eos>', '_')
    # save as PCM 16 bit wav file
    sf.write(
        save_dir / filename,
        wav.detach().cpu().numpy(),
        16000,
        "PCM_16",
    )
#%%
phones = 'p i t a <sos/eos>'.split()
save_wav(phones, lid=None)
for lid in lid2spk.keys():
    save_wav(phones, lid=lid)

Unfortunately the results are bad no matter what language ID or xvector i use (or None). Here's a simple 'p i t a' in all the 7 CSS languages: pita.zip

Seems like Transformer-TTS has the repeat and skip phonemes problem.

Note: if you want to do anything else (for example getting the x-vectors from your own dataset), you have to prepare the datasets according to ESPnet recipes (in my case it was AISHELL-3). I copied egs2/aishell3 from espnet, and modified data_prep.sh and tts.sh to only run on the test set.

./run.sh --stage 1 --stop-stage 1

Here, data/*_phn/text needs to be converted to IPA that the pretrained model can accept.

Then you should be able to run decode_hifigan/run.sh on your dataset (changing the options according to your config).

Takaaki-Saeki commented 1 year ago

Thank you so much for your great work! Also, sorry for the late reply. However, the synthetic speech should be much better for the seven CSS10 languages. I noticed that you have set use_gst to False.

pretrained_model.tts.use_gst = False

However, in our training config, we use GST as shown below. https://github.com/Takaaki-Saeki/zm-text-tts/blob/master/egs2/masmultts/tts1/conf/tuning/train_gst%2Bxvector_transformer.yaml

Therefore, for the inference of the pretrained model to work well, you might need to check that the model matches perfectly with the training config.