TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.84k stars 815 forks source link

Inference on MB MelGAN sounds great until testing on iOS #795

Closed jakerb closed 8 months ago

jakerb commented 1 year ago

Hi all, I've been training a custom voice model named alice. I'm new to Python never mind TensorFlowTTS so please critique anything below! I want to get it right.

The problem I'm facing is that the model I'm training sounds OK in training but on device doesn't sound anything like my model.

1. Training MB Melgan

I'm using the train_multiband_melgan.py script provided in examples running the following to produce 90K steps so far:

CUDA_VISIBLE_DEVICES=0 python ./examples/multiband_melgan/train_multiband_melgan.py \
  --train-dir /workspace/content/tfvoice/dump_alice/train/ \
  --dev-dir /workspace/content/tfvoice/dump_alice/valid/ \
  --outdir /workspace/content/tfvoice/out/ \
  --config /workspace/content/tfvoice/dump_alice/out/config.yml \
  --use-norm 1 \
  --resume /workspace/content/tfvoice/out/checkpoints/ckpt-90000

2. Exporting MB Melgan to TFLite

When I decode the model using decode_mb_melgan.py the audio clips generated sound OK (some odd artefacts in the audio which I assume will get better as I continue training). I then export the model to TFLite using the following:

import tensorflow as tf

import yaml
import numpy as np
import matplotlib.pyplot as plt

import IPython.display as ipd

from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import AutoProcessor

# 2.load mb_melgan .h5 model
mb_melgan_config = AutoConfig.from_pretrained('/workspace/content/tfvoice/out/checkpoints/config.yml')
mb_melgan = TFAutoModel.from_pretrained(
    config=mb_melgan_config,
    pretrained_path="/workspace/content/tfvoice/out/checkpoints/generator-90000.h5",
    name="mb_melgan"
)
# 3.h5 convert to tflite
def convert_to_tflite(model, name="model", quantize=True):
    # Concrete Function
    concrete_function = model.inference_tflite.get_concrete_function()
    converter = tf.lite.TFLiteConverter.from_concrete_functions(
        [concrete_function]
    )
    converter.optimizations = []
    converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,
                                            tf.lite.OpsSet.SELECT_TF_OPS]
    if not quantize: 
        converter.target_spec.supported_types = [tf.float16] 
    tflite_model = converter.convert()
    saved_path = '/workspace/content/tfvoice/export/v2/' + name + ('_quan.tflite' if quantize else '' + '.tflite')

    # Save the TF Lite model.
    with open(saved_path, 'wb') as f:
        f.write(tflite_model)
    print('Model: %s size is %f MBs.' % (name, len(tflite_model) / 1024 / 1024.0) )
    print(saved_path);
    return saved_path

# 4. do h5 convert to tflite    
mb_melgan_tflite_path = convert_to_tflite(mb_melgan, "mb_melgan", quantize=False)
mb_melgan_tflite_quant_path = convert_to_tflite(mb_melgan, "mb_melgan", quantize=True)
print("done!")

This gives a mb_melgan.tflite and mb_melgan_quan.tflite file.

3. Testing MB Melgan in iOS

Using the iOS "FastSpeech2 and MB MelGAN models on iOS" demo app I build the XCode project (after downloading LJSpeech models from the model URL provided in the README.)

I replace the mb_melgan.tflite file with my own from step 3 of this issue and run that on the device. The playback on the device sounds just like the LJSpeech model only and nothing like the model I've trained.

I am confident I'm doing something wrong in one or more of these steps (or all of them) so any help you can offer would be greatly appreciated, I've been working with this for the past few months and feeling like I'm hitting a brick wall.

w11wo commented 1 year ago

Hi @jakerb,

I think you will have to change the list of converter.target_spec.supported_ops to only include tf.lite.OpsSet.SELECT_TF_OPS, i.e. remove tf.lite.OpsSet.TFLITE_BUILTINS, when exporting MB-MelGAN.

I forgot where I read this, but applying SELECT_TF_OPS will slightly decrease the precision of the model, which leads to bad synthesized audio quality. It is still acceptable when applied to FastSpeech2, though.

Hope this fixes your issue!

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.