PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.15k stars 5.56k forks source link

How to load pre-trained model and run inference? #23509

Closed qo4on closed 4 years ago

qo4on commented 4 years ago

Hi! Parakeet looks very promising, but I can't find a working example to use. I'm trying to run inference for WaveFlow model. How can I load this model and fed a mel spectrogram to it?

sandyhouse commented 4 years ago

Please refer to the following doc for how to deploy inference model: https://www.paddlepaddle.org.cn/documentation/docs/en/advanced_guide/inference_deployment/index_en.html

qo4on commented 4 years ago

Thanks for your answer. Unfortunately I still can't find the code, which loads the model. There is a C++ example, but I need the same for Python. For example, with Tensorflow I have a checkpoint:

ls {checkpoint_dir}
checkpoint           cp.ckpt.data-00001-of-00002
cp.ckpt.data-00000-of-00002  cp.ckpt.index

I can create a model, load weights and run inference:

model = create_model()
model.load_weights(checkpoint_path)
predictions = model(test_input, training=False)

Using PadlePadle, I have a checkpoint:

step-2000000.pdparams
waveflow_ljspeech.yaml

What is the workflow for .pdparams and .yaml files?

sandyhouse commented 4 years ago

I see. You can use load_inference_model to load your model. For more detail, please refer to https://www.paddlepaddle.org.cn/documentation/docs/en/api/io/load_inference_model.html

qo4on commented 4 years ago

The code from this link creates three files:

image

And when loading my checkpoint it looks for __model__ and throws an error because I don't have it:

path = "/downloads/waveflow_res128_ljspeech_ckpt_1.0"
[inference_program, feed_target_names, fetch_targets] = (
    fluid.io.load_inference_model(dirname=path, executor=exe))

FileNotFoundError: [Errno 2] No such file or directory: '/downloads/waveflow_res128_ljspeech_ckpt_1.0/__model__'

How can I create a model from these two files?

step-2000000.pdparams
waveflow_ljspeech.yaml
sandyhouse commented 4 years ago

How do you get the pre-trained model. If you save the model with one of the apis save_persistables, save_params or save_vars, then you can load the model using one of load_persistables, load_params, or load_vars.

qo4on commented 4 years ago

It is PaddlePaddle official pretrained model. They didn't tell how they saved it.

sandyhouse commented 4 years ago

Please use load_parameters. For example: image

qo4on commented 4 years ago

I have iteration=0:

config =
{'batch_size': 8,
 'fft_size': 1024,
 'fft_window_shift': 256,
 'fft_window_size': 1024,
 'kernel_h': 3,
 'kernel_w': 3,
 'learning_rate': 0.0002,
 'max_iterations': 3000000,
 'mel_bands': 80,
 'mel_fmax': 8000.0,
 'mel_fmin': 0.0,
 'n_channels': 64,
 'n_flows': 8,
 'n_group': 16,
 'n_layers': 8,
 'root': './data/LJSpeech-1.1',
 'sample_rate': 22050,
 'save_every': 10000,
 'seed': 1234,
 'segment_length': 16000,
 'sigma': 1.0,
 'test_every': 2000,
 'use_fp16': False,
 'use_gpu': True,
 'valid_size': 16}

class Config():
  def __init__(self, **entries):
    self.__dict__.update(entries)

config = Config(**config)

from parakeet.models.waveflow import WaveFlowModule
from parakeet.utils import io

model = WaveFlowModule(config)
iteration = io.load_parameters(model, checkpoint_dir="/downloads/waveflow_res128_ljspeech_ckpt_1.0")
iteration
0

If I try checkpoint_path I have an error:

iteration = io.load_parameters(model, checkpoint_path="/downloads/waveflow_res128_ljspeech_ckpt_1.0/step-2000000.pdparams")

ValueError: invalid literal for int() with base 10: '2000000.pdparams'
sandyhouse commented 4 years ago

You only need to provide the base name of the parameter file, which is step-2000000, no extension name .pdparams or .pdopt is needed.

sandyhouse commented 4 years ago

Closing this issue. If you have any further question, please reopen it.

qo4on commented 4 years ago

I still can't make it work. Suppose I have a mel spectrgram (can be copypasted to Colab):

!wget -qqq https://soundfrancisco.com/wp-content/uploads/2019/09/icarfactory.mp3 > /dev/null
audio_pth = "/content/icarfactory.mp3"
import IPython, librosa, librosa.display
y, sr = librosa.load(audio_pth)
# trim silent edges
audio, _ = librosa.effects.trim(y)
librosa.display.waveplot(audio, sr=sr)
IPython.display.Audio(audio_pth)
n_mels = 80
n_fft = 1024
hop_length = 256
S = librosa.feature.melspectrogram(audio, sr=sr, n_fft=n_fft, 
                                   hop_length=hop_length, 
                                   n_mels=n_mels)
S_DB = librosa.power_to_db(S, ref=np.max)
librosa.display.specshow(S_DB, sr=sr, hop_length=hop_length, 
                         x_axis='time', y_axis='mel');
plt.colorbar(format='%+2.0f dB');
S.shape

(80, 679)

Then I load the model:

config =
{'batch_size': 8,
 'checkpoint': None,
 'checkpoint_dir': '/content/downloads/waveflow_res128_ljspeech_ckpt_1.0',
 'fft_size': 1024,
 'fft_window_shift': 256,
 'fft_window_size': 1024,
 'iteration': None,
 'kernel_h': 3,
 'kernel_w': 3,
 'learning_rate': 0.0002,
 'max_iterations': 3000000,
 'mel_bands': 80,
 'mel_fmax': 8000.0,
 'mel_fmin': 0.0,
 'n_channels': 64,
 'n_flows': 8,
 'n_group': 16,
 'n_layers': 8,
 'name': '',
 'output': './syn_audios',
 'sample': 0,
 'sample_rate': 22050,
 'save_every': 10000,
 'seed': 1234,
 'segment_length': 16000,
 'sigma': 1.0,
 'test_every': 2000,
 'use_fp16': True,
 'use_gpu': True,
 'valid_size': 16}

class Config():
  def __init__(self, **entries):
    self.__dict__.update(entries)

config = Config(**config)

from parakeet.models.waveflow import WaveFlowModule
from parakeet.utils import io

model = WaveFlowModule(config)
iteration = io.load_parameters(model, checkpoint_dir="/downloads/waveflow_res128_ljspeech_ckpt_1.0")
iteration

/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'input' in assign only support float16 in GPU now. (When the type of input in assign is Variable.)
  (input_name, op_name, extra_message))
/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'x' in cast only support float16 in GPU now. 
  (input_name, op_name, extra_message))
0

I'm not sure, but iteration=0 means that pretrained model was not loaded, am I right? Then I add spectrogram S to batch=1 and synthesize:

model.synthesize(np.expand_dims(S, axis=0))

TypeError                                 Traceback (most recent call last)
<ipython-input-134-f6905ea98f56> in <module>()
----> 1 model.synthesize(np.expand_dims(S, axis=0))

3 frames
/content/Parakeet/parakeet/models/waveflow/waveflow_modules.py in synthesize(self, mel, sigma)
    406         """
    407         if self.dtype == "float16":
--> 408             mel = fluid.layers.cast(mel, self.dtype)
    409         mel = self.conditioner.infer(mel)
    410         # From [bs, mel_bands, time] to [bs, mel_bands, n_group, time/n_group]

/usr/local/lib/python3.6/dist-packages/paddle/fluid/layers/tensor.py in cast(x, dtype)
    196         x, 'x', Variable,
    197         ['bool', 'float16', 'float32', 'float64', 'int32', 'int64', 'uint8'],
--> 198         'cast')
    199     out = helper.create_variable_for_type_inference(dtype=dtype)
    200     helper.append_op(

/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py in check_type_and_dtype(input, input_name, expected_type, expected_dtype, op_name, extra_message)
     72                          op_name,
     73                          extra_message=''):
---> 74     check_type(input, input_name, expected_type, op_name, extra_message)
     75     check_dtype(input.dtype, input_name, expected_dtype, op_name, extra_message)
     76 

/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py in check_type(input, input_name, expected_type, op_name, extra_message)
     80         raise TypeError(
     81             "The type of '%s' in %s must be %s, but received %s. %s" %
---> 82             (input_name, op_name, expected_type, type(input), extra_message))
     83 
     84 

TypeError: The type of 'x' in cast must be <class 'paddle.fluid.framework.Variable'>, but received <class 'numpy.ndarray'>. 

Do you know how I can fix this error?

sandyhouse commented 4 years ago

Please note that io.load_params has no return values. The type of the 'x' parameter (i.e. mel) for cast must be Paddle Variable instead of numpy array. Maybe, you can use paddle.fluid.dygraph.to_variable to convert a numpy array to a Paddle Variable.

qo4on commented 4 years ago

Thank you. I managed to run it with no errors. But the output file contains noise only. Do you know what's wrong? This code is based on official example:

from parakeet.models.waveflow import waveflow_modules
from parakeet.utils import io
import paddle.fluid.dygraph as dg
from paddle import fluid
from scipy.io.wavfile import write
from ruamel import yaml
import random

class WaveFlow(waveflow_modules.WaveFlowModule):
  def __init__(self, config):
    super().__init__(config)

  @dg.no_grad
  def infer(self, mel):
    self.eval()
    print(mel.shape)
    audio = self.synthesize(mel)

    # Denormalize audio from [-1, 1] to [-32768, 32768] int16 range.
    audio = audio.numpy().astype("float32") * 32768.0
    audio = audio.astype('int16')
    filename = 'test.wav'
    print(audio.shape, 'audio.shape')
    write(filename, config.sample_rate, audio[0])

class Config():
  def __init__(self, **entries):
    self.__dict__.update(entries)

pth = "/content/Parakeet/examples/waveflow/configs/waveflow_ljspeech.yaml"
with open(pth) as f:
  config = yaml.load(f, Loader=yaml.Loader)

config['checkpoint'] = None
config['checkpoint_dir'] = "/content/downloads/waveflow_res128_ljspeech_ckpt_1.0"
config['iteration'] = None
config['name'] = ''
config['output'] = './syn_audios'
config['sample'] = 0
config['use_fp16'] = True
config['use_gpu'] = True

config = Config(**config)
print(config.__dict__)

place = fluid.CUDAPlace(0) if config.use_gpu else fluid.CPUPlace()
with dg.guard(place):
  # Fix random seed.
  seed = config.seed
  random.seed(seed)
  np.random.seed(seed)
  fluid.default_startup_program().random_seed = seed
  fluid.default_main_program().random_seed = seed

  model = WaveFlow(config)

  # Dry run once to create and initalize all necessary parameters.
  dtype = "float16" if config.use_fp16 else "float32"
  audio = dg.to_variable(np.random.randn(1, 16000).astype(dtype))
  mel = dg.to_variable(
    np.random.randn(1, config.mel_bands, 63).astype(dtype))
  model(audio, mel)

  iteration = io.load_parameters(model, checkpoint_dir=config.checkpoint_dir)
  print(S.shape, np.min(S), np.max(S))
  model.infer(dg.to_variable(np.expand_dims(S, axis=0)))

{'valid_size': 16, 'segment_length': 16000, 'sample_rate': 22050, 'fft_window_shift': 256, 'fft_window_size': 1024, 'fft_size': 1024, 'mel_bands': 80, 'mel_fmin': 0.0, 'mel_fmax': 8000.0, 'seed': 1234, 'learning_rate': 0.0002, 'batch_size': 8, 'test_every': 2000, 'save_every': 10000, 'max_iterations': 3000000, 'sigma': 1.0, 'n_flows': 8, 'n_group': 16, 'n_layers': 8, 'n_channels': 64, 'kernel_h': 3, 'kernel_w': 3, 'checkpoint': None, 'checkpoint_dir': '/content/downloads/waveflow_res128_ljspeech_ckpt_1.0', 'iteration': None, 'name': '', 'output': './syn_audios', 'sample': 0, 'use_fp16': True, 'use_gpu': True}
/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'input' in assign only support float16 in GPU now. (When the type of input in assign is Variable.)
  (input_name, op_name, extra_message))
/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'x' in cast only support float16 in GPU now. 
  (input_name, op_name, extra_message))
/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'input' in squeeze only support float16 in GPU now. 
  (input_name, op_name, extra_message))
(80, 679) 5.492911396874237e-11 366.29962746001155
[1, 80, 679]
(1, 173552) audio.shape
qo4on commented 4 years ago

I found that there is a get_mel function in the code examples which does some normalization. I applied it here. But I'm still getting a noise audio file. Please help.

!export CUDA_VISIBLE_DEVICES=0
from parakeet.models.waveflow import waveflow_modules
from parakeet.utils import io
import paddle.fluid.dygraph as dg
from paddle import fluid
from scipy.io.wavfile import write
from ruamel import yaml
import random, librosa

class WaveFlow(waveflow_modules.WaveFlowModule):
  def __init__(self, config):
    super().__init__(config)

  @dg.no_grad
  def infer(self, mel):
    self.eval()
    print(mel.shape, 'mel.shape')
    audio = self.synthesize(mel)

    # Denormalize audio from [-1, 1] to [-32768, 32768] int16 range.
    audio = audio.numpy().astype("float32") * 32768.0
    audio = audio.astype('int16')
    filename = 'test.wav'
    print(audio.shape, 'audio.shape')
    write(filename, config.sample_rate, audio[0])

class Config():
  def __init__(self, **entries):
    self.__dict__.update(entries)

def get_mel(audio):
  spectrogram = librosa.core.stft(
      audio,
      n_fft=config.fft_size,
      hop_length=config.fft_window_shift,
      win_length=config.fft_window_size)
  spectrogram_magnitude = np.abs(spectrogram)

  # mel_filter_bank shape: [n_mels, 1 + n_fft/2]
  mel_filter_bank = librosa.filters.mel(sr=config.sample_rate,
                                        n_fft=config.fft_size,
                                        n_mels=config.mel_bands,
                                        fmin=config.mel_fmin,
                                        fmax=config.mel_fmax)
  # mel shape: [n_mels, num_frames]
  mel = np.dot(mel_filter_bank, spectrogram_magnitude)

  # Normalize mel.
  clip_val = 1e-5
  ref_constant = 1
  mel = np.log(np.clip(mel, a_min=clip_val, a_max=None) * ref_constant)

  return mel

pth = "/content/Parakeet/examples/waveflow/configs/waveflow_ljspeech.yaml"
with open(pth) as f:
  config = yaml.load(f, Loader=yaml.Loader)

config['checkpoint'] = None
config['checkpoint_dir'] = "/content/downloads/waveflow_res128_ljspeech_ckpt_1.0"
config['iteration'] = None
config['name'] = ''
config['output'] = './syn_audios'
config['sample'] = 0
config['use_fp16'] = True
config['use_gpu'] = True

config = Config(**config)
print(config.__dict__)

place = fluid.CUDAPlace(0) if config.use_gpu else fluid.CPUPlace()
with dg.guard(place):
  # Fix random seed.
  seed = config.seed
  random.seed(seed)
  np.random.seed(seed)
  fluid.default_startup_program().random_seed = seed
  fluid.default_main_program().random_seed = seed

  model = WaveFlow(config)

  # Dry run once to create and initalize all necessary parameters.
  dtype = "float16" if config.use_fp16 else "float32"
  audio = dg.to_variable(np.random.randn(1, 16000).astype(dtype))
  mel = dg.to_variable(
    np.random.randn(1, config.mel_bands, 63).astype(dtype))
  model(audio, mel)

  iteration = io.load_parameters(model, checkpoint_dir=config.checkpoint_dir)

  audio, sr = librosa.load(audio_pth)
  mel = dg.to_variable(np.expand_dims(get_mel(audio), axis=0))
  model.infer(mel)

{'valid_size': 16, 'segment_length': 16000, 'sample_rate': 22050, 'fft_window_shift': 256, 'fft_window_size': 1024, 'fft_size': 1024, 'mel_bands': 80, 'mel_fmin': 0.0, 'mel_fmax': 8000.0, 'seed': 1234, 'learning_rate': 0.0002, 'batch_size': 8, 'test_every': 2000, 'save_every': 10000, 'max_iterations': 3000000, 'sigma': 1.0, 'n_flows': 8, 'n_group': 16, 'n_layers': 8, 'n_channels': 64, 'kernel_h': 3, 'kernel_w': 3, 'checkpoint': None, 'checkpoint_dir': '/content/downloads/waveflow_res128_ljspeech_ckpt_1.0', 'iteration': None, 'name': '', 'output': './syn_audios', 'sample': 0, 'use_fp16': True, 'use_gpu': True}
/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'input' in assign only support float16 in GPU now. (When the type of input in assign is Variable.)
  (input_name, op_name, extra_message))
/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'x' in cast only support float16 in GPU now. 
  (input_name, op_name, extra_message))
/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'input' in squeeze only support float16 in GPU now. 
  (input_name, op_name, extra_message))
[1, 80, 684] mel.shape
/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'x' in cast only support float16 in GPU now. 
  (input_name, op_name, extra_message))
/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'input' in squeeze only support float16 in GPU now. 
  (input_name, op_name, extra_message))
(1, 174832) audio.shape
qo4on commented 4 years ago

Do you have any idea why this is happening?

qo4on commented 4 years ago

I replaced my test audio file with the audio from ljspeach dataset. It doesn't change anything, I'm still getting noise. To reproduce run this code in Colab:

!wget -qqq https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_ckpt_1.0.zip > /dev/null
!mkdir -p /content/downloads
!unzip -qqq /content/waveflow_res128_ljspeech_ckpt_1.0.zip -d /content/downloads > /dev/null
!rm -rf /content/sample_data /content/waveflow_res128_ljspeech_ckpt_1.0.zip
!sudo apt-get update -y -qqq --fix-missing && apt-get install -y -qqq libsndfile1 > /dev/null
!pip install -U -qqq imgaug scipy albumentations paddlepaddle-gpu > /dev/null
!git clone -qqq https://github.com/PaddlePaddle/Parakeet > /dev/null
%cd /content/Parakeet
!pip install -qqq -e . > /dev/null

import os, pathlib, nltk, sys
import numpy as np
import matplotlib.pyplot as plt
nltk.download("punkt")
nltk.download("cmudict")

pth_1 = "/content/Parakeet"
if pth_1 not in sys.path: sys.path.insert(0, pth_1)

pth_2 = "/content/Parakeet/examples/waveflow"
if pth_2 not in sys.path: sys.path.insert(0, pth_2)

%cd /content/downloads

!curl -Ls https://dl.dropboxusercontent.com/s/jj78665lrhiod97/LJ001-0001.wav -o LJ001-0001.wav
audio_pth = "/content/downloads/LJ001-0001.wav"

!export CUDA_VISIBLE_DEVICES=0
from parakeet.models.waveflow import waveflow_modules
from parakeet.modules import weight_norm
from parakeet.utils import io
import paddle.fluid.dygraph as dg
from paddle import fluid
from scipy.io.wavfile import read
from scipy.io.wavfile import write
from ruamel import yaml
import random, librosa

class Config():
  def __init__(self, **entries):
    self.__dict__.update(entries)

class WaveFlow():
  def __init__(self,
               config,
               parallel=False,
               rank=0,
               nranks=1,
               tb_logger=None):
    self.config = config
    self.checkpoint_dir = config.checkpoint_dir
    self.parallel = parallel
    self.rank = rank
    self.nranks = nranks
    self.tb_logger = tb_logger
    self.dtype = "float16" if config.use_fp16 else "float32"

  def build(self):
    config = self.config

    waveflow = waveflow_modules.WaveFlowModule(config)

    # Dry run once to create and initalize all necessary parameters.
    audio = dg.to_variable(np.random.randn(1, 16000).astype(self.dtype))
    mel = dg.to_variable(
        np.random.randn(1, config.mel_bands, 63).astype(self.dtype))
    waveflow(audio, mel)

    iteration = io.load_parameters(waveflow, checkpoint_dir=self.checkpoint_dir)

    for layer in waveflow.sublayers():
      if isinstance(layer, weight_norm.WeightNormWrapper):
        layer.remove_weight_norm()    
    self.waveflow = waveflow

    return iteration

  @dg.no_grad
  def infer(self, mel):
    self.waveflow.eval()
    config = self.config
    print(mel.shape, 'mel.shape')
    audio = self.waveflow.synthesize(mel, sigma=self.config.sigma)
    audio = audio[0]

    # Denormalize audio from [-1, 1] to [-32768, 32768] int16 range.
    audio = audio.numpy().astype("float32") * 32768.0
    audio = audio.astype('int16')
    filename = 'test.wav'
    print(audio.shape, 'audio.shape')
    write(filename, config.sample_rate, audio)

def get_mel(audio):
  spectrogram = librosa.core.stft(
      audio,
      n_fft=config.fft_size,
      hop_length=config.fft_window_shift,
      win_length=config.fft_window_size)
  spectrogram_magnitude = np.abs(spectrogram)

  # mel_filter_bank shape: [n_mels, 1 + n_fft/2]
  mel_filter_bank = librosa.filters.mel(sr=config.sample_rate,
                                        n_fft=config.fft_size,
                                        n_mels=config.mel_bands,
                                        fmin=config.mel_fmin,
                                        fmax=config.mel_fmax)
  # mel shape: [n_mels, num_frames]
  mel = np.dot(mel_filter_bank, spectrogram_magnitude)

  # Normalize mel.
  clip_val = 1e-5
  ref_constant = 1
  mel = np.log(np.clip(mel, a_min=clip_val, a_max=None) * ref_constant)

  return mel

def get_config(pth="/content/Parakeet/examples/waveflow/configs/waveflow_ljspeech.yaml"):
  with open(pth) as f:
    config = yaml.load(f, Loader=yaml.Loader)

  config['checkpoint'] = None
  config['checkpoint_dir'] = "/content/downloads/waveflow_res128_ljspeech_ckpt_1.0"
  config['iteration'] = None
  config['name'] = ''
  config['output'] = './syn_audios'
  config['sample'] = 0
  config['use_fp16'] = True
  config['use_gpu'] = True
  return Config(**config)

config = get_config()
print(config.__dict__)

place = fluid.CUDAPlace(0) if config.use_gpu else fluid.CPUPlace()
with dg.guard(place):
  # Fix random seed.
  seed = config.seed
  random.seed(seed)
  np.random.seed(seed)
  fluid.default_startup_program().random_seed = seed
  fluid.default_main_program().random_seed = seed

  # Build model.
  model = WaveFlow(config)
  iteration = model.build()
  print(iteration, "iteration")
  # Obtain the current iteration.
  if config.checkpoint is None:
    if config.iteration is None:
      print("_load_latest_checkpoint")
      iteration = io._load_latest_checkpoint(config.checkpoint_dir)
    else:
      iteration = config.iteration
  else:
    iteration = int(config.checkpoint.split('/')[-1].split('-')[-1])
  print(config.checkpoint_dir, iteration)
  loaded_sr, audio = read(audio_pth)
  mel = dg.to_variable(np.expand_dims(get_mel(np.asarray(audio, dtype=np.float32)), axis=0))
  model.infer(mel)

{'valid_size': 16, 'segment_length': 16000, 'sample_rate': 22050, 'fft_window_shift': 256, 'fft_window_size': 1024, 'fft_size': 1024, 'mel_bands': 80, 'mel_fmin': 0.0, 'mel_fmax': 8000.0, 'seed': 1234, 'learning_rate': 0.0002, 'batch_size': 8, 'test_every': 2000, 'save_every': 10000, 'max_iterations': 3000000, 'sigma': 1.0, 'n_flows': 8, 'n_group': 16, 'n_layers': 8, 'n_channels': 64, 'kernel_h': 3, 'kernel_w': 3, 'checkpoint': None, 'checkpoint_dir': '/content/downloads/waveflow_res128_ljspeech_ckpt_1.0', 'iteration': None, 'name': '', 'output': './syn_audios', 'sample': 0, 'use_fp16': True, 'use_gpu': True}
/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'input' in assign only support float16 in GPU now. (When the type of input in assign is Variable.)
  (input_name, op_name, extra_message))
/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'x' in cast only support float16 in GPU now. 
  (input_name, op_name, extra_message))
/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'input' in squeeze only support float16 in GPU now. 
  (input_name, op_name, extra_message))
0 iteration
_load_latest_checkpoint
/content/downloads/waveflow_res128_ljspeech_ckpt_1.0 0
[1, 80, 832] mel.shape
(212720,) audio.shape
qo4on commented 4 years ago

I found a bug in your official code example: In synthesis.py -> synthesize replace:

iteration = io.load_latest_checkpoint(checkpoint_dir)

with

iteration = io._load_latest_checkpoint(checkpoint_dir)

I also tried your official command line command: I replaced checkpoint_dir in synthesis.py -> synthesize to my downloaded checkpoint path:

checkpoint_dir = "/content/downloads/waveflow_res128_ljspeech_ckpt_1.0"

and ran your official command line command:

!export CUDA_VISIBLE_DEVICES=0
!python -u synthesis.py \
    --config=./configs/waveflow_ljspeech.yaml \
    --root=./data/LJSpeech-1.1 \
    --name=ModelName --use_gpu=true \
    --output=./syn_audios \
    --sample=0 \
    --sigma=1.0

{'batch_size': 8,
 'checkpoint': None,
 'config': './configs/waveflow_ljspeech.yaml',
 'fft_size': 1024,
 'fft_window_shift': 256,
 'fft_window_size': 1024,
 'iteration': None,
 'kernel_h': 3,
 'kernel_w': 3,
 'learning_rate': 0.0002,
 'max_iterations': 3000000,
 'mel_bands': 80,
 'mel_fmax': 8000.0,
 'mel_fmin': 0.0,
 'model': 'waveflow',
 'n_channels': 64,
 'n_flows': 8,
 'n_group': 16,
 'n_layers': 8,
 'name': 'ModelName',
 'output': './syn_audios',
 'root': './data/LJSpeech-1.1',
 'sample': 0,
 'sample_rate': 22050,
 'save_every': 10000,
 'seed': 1234,
 'segment_length': 16000,
 'sigma': 1.0,
 'test_every': 2000,
 'use_fp16': True,
 'use_gpu': True,
 'valid_size': 16}
Random Seed:  1234
W0411 09:45:54.968588  1820 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.1, Runtime API Version: 10.0
W0411 09:45:54.972355  1820 device_context.cc:245] device: 0, cuDNN Version: 7.6.
/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'input' in assign only support float16 in GPU now. (When the type of input in assign is Variable.)
  (input_name, op_name, extra_message))
/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'x' in cast only support float16 in GPU now. 
  (input_name, op_name, extra_message))
/usr/local/lib/python3.6/dist-packages/paddle/fluid/data_feeder.py:93: UserWarning: The data type of 'input' in squeeze only support float16 in GPU now. 
  (input_name, op_name, extra_message))
Rank 0: checkpoint loaded.
Synthesize sample 0, save as ./syn_audios/ModelName/iter-0/valid_0.wav
audio time 9.6472, synthesis time 1.2663

The result was an audio file with just noise.

I also tried your smaller model: https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res64_ljspeech_ckpt_1.0.zip The result was the same noise audio file.

qo4on commented 4 years ago

I hope that @kuke can comment for a possible solution when he has time.

kuke commented 4 years ago

@qo4on Sorry for the delayed reply.

The bug you mentioned has been fixed already, please update the code. And try to run the synthesis in this way

python -u synthesis.py \
    --root=LJSpeech-1.1 \
    --name=${ModelName} --use_gpu=true \
    --output=./syn_audios \
    --sigma=1.0 \
    --use_fp16=true \
    --config=waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml \
    --checkpoint=waveflow_res128_ljspeech_ckpt_1.0/step-2000000 \

There are some guidelines in the README, but I'm not sure they are clear enough.

qo4on commented 4 years ago

@kuke Thank you very much!

GuangfuWang commented 2 years ago

Please refer to the following doc for how to deploy inference model: https://www.paddlepaddle.org.cn/documentation/docs/en/advanced_guide/inference_deployment/index_en.so

so if there are some simple examples for c++/python demo to show how to load a paddle detection model and use it as inference? i checked inference section, but onle a model compression intro is provided...