ZDisket / TensorVox

Desktop application for neural speech synthesis written in C++
MIT License
210 stars 20 forks source link

using the models in python #2

Closed copperdong closed 3 years ago

copperdong commented 3 years ago

Hello, thank you for your code. I tried to test the models with Python, but the generated wav file is wrong. Can you help me to check my code? thank you!

import numpy as np
import soundfile as sf
import yaml
import time
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow_tts.inference import AutoProcessor

processor = AutoProcessor.from_pretrained(pretrained_path="./test/files/ljspeech_mapper.json")
input_text = "There’s a way to measure the acute emotional intelligence that has never gone out of style."
input_ids = processor.text_to_sequence(input_text)

fastspeech2 = tf.saved_model.load(r"temp\Win64DemoWithModel\LJ\melgen")
mb_melgan = tf.saved_model.load(r"temp\Win64DemoWithModel\LJ\vocoder")
print("fs2")

start = time.time()
mel_before, mel_after, duration_outputs, _, _ = fastspeech2.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
    speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    f0_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32)
)
print(mel_before.shape) # (1, 345, 80)
print("fs:", time.time() - start)

audios = mb_melgan.inference(mel_before)
audio_after = mb_melgan.inference(mel_after)
print("fs:", time.time() - start)
print(audios.shape)
sf.write('./mel_before2.wav', audios[0, :, 0], 22050, "PCM_16")
sf.write('./mel_after2.wav', audio_after[0, :, 0], 22050, "PCM_16")
plt.plot(audios[0, :, 0])

plt.show()
ZDisket commented 3 years ago

@copperdong For Python usage open an issue in the TensorflowTTS repo: https://github.com/TensorSpeech/TensorFlowTTS Also, my FS2 model uses phonemes, so using a text only processor is wrong. If you're using my fork of TensorflowTTS, this is how you do it:

text = "There’s a way to measure the acute emotional intelligence that has never gone out of style"
proc = LJSpeechProcessor(None,"english_cleaners") 
ids, arpatxt = proc.processtxtph(text)

Feed the ids into the model like this

tf.convert_to_tensor(ids, dtype=tf.int32), 0

copperdong commented 3 years ago

Thank you! I find that memory keeps growing after every inference. I made the following changes. FastSpeech2.cpp

bool FastSpeech2::Initialize(const std::string & SavedModelFolder)  {
    try {
               **std::vector<uint8_t> config = {0x10, 0x1, 0x28, 0x1};
        FastSpeech = new Model(SavedModelFolder, config)**;
    }
    catch (...) {
        FastSpeech = nullptr;
        return false;

    }
    return true;
}

The corresponding Python code is ...

import tensorflow as tf
config = tf.compat.v1.ConfigProto(inter_op_parallelism_threads=1, intra_op_parallelism_threads=1)
serialized = config.SerializeToString()
print(list(map(hex, serialized)))
ZDisket commented 3 years ago

@copperdong That's interesting. I'll check it out.