Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.27k stars 905 forks source link

Training on custom data went well, but when I try to synthesise voice for new text this error occurs #498

Open Stanley80 opened 4 years ago

Stanley80 commented 4 years ago

I training tacotron for days on my MAC OS X CPUs, but Synthesis returns the following error.

My system is Python 3.7 TF 1.13.1 Keras 2.2.2 librosa 0.6.2

Using TensorFlow backend. loaded model at logs-Tacotron/taco_pretrained/tacotron_model.ckpt-1800 Hyperparameters: GL_on_GPU: False NN_init: True NN_scaler: 0.3 allow_clipping_in_normalization: True attention_dim: 128 attention_filters: 32 attention_kernel: (31,) attention_win_size: 7 batch_norm_position: after cbhg_conv_channels: 128 cbhg_highway_units: 128 cbhg_highwaynet_layers: 4 cbhg_kernels: 8 cbhg_pool_size: 2 cbhg_projection: 256 cbhg_projection_kernel_size: 3 cbhg_rnn_units: 128 cdf_loss: False cin_channels: 80 cleaners: basic_cleaners clip_for_wavenet: True clip_mels_length: True clip_outputs: True cross_entropy_pos_weight: 1 cumulative_weights: True decoder_layers: 2 decoder_lstm_units: 1024 embedding_dim: 512 enc_conv_channels: 512 enc_conv_kernel_size: (5,) enc_conv_num_layers: 3 encoder_lstm_units: 256 fmax: 6600 fmin: 55 frame_shift_ms: None freq_axis_kernel_size: 3 gate_channels: 256 gin_channels: -1 griffin_lim_iters: 60 hop_size: 551 input_type: raw kernel_size: 3 layers: 20 leaky_alpha: 0.4 legacy: True log_scale_min: -32.23619130191664 log_scale_min_gauss: -16.11809565095832 lower_bound_decay: 0.1 magnitude_power: 2.0 mask_decoder: False mask_encoder: True max_abs_value: 4.0 max_iters: 20000 max_mel_frames: 900 max_time_sec: None max_time_steps: 11000 min_level_db: -100 n_fft: 1100 n_speakers: 5 normalize_for_wavenet: False num_freq: 551 num_mels: 80 out_channels: 2 outputs_per_step: 2 postnet_channels: 512 postnet_kernel_size: (5,) postnet_num_layers: 5 power: 1.5 predict_linear: True preemphasis: 0.97 preemphasize: True prenet_layers: [256, 256] quantize_channels: 65536 ref_level_db: 20 rescale: True rescaling_max: 0.999 residual_channels: 128 residual_legacy: True sample_rate: 44100 signal_normalization: True silence_threshold: 2 skip_out_channels: 128 smoothing: True speakers: ['speaker0', 'speaker1', 'speaker2', 'speaker3', 'speaker4'] speakers_path: None split_on_cpu: True stacks: 2 stop_at_any: True symmetric_mels: True synthesis_constraint: False synthesis_constraint_type: window tacotron_adam_beta1: 0.9 tacotron_adam_beta2: 0.999 tacotron_adam_epsilon: 1e-06 tacotron_batch_size: 32 tacotron_clip_gradients: True tacotron_data_random_state: 1234 tacotron_decay_learning_rate: True tacotron_decay_rate: 0.5 tacotron_decay_steps: 18000 tacotron_dropout_rate: 0.5 tacotron_final_learning_rate: 0.0001 tacotron_fine_tuning: False tacotron_initial_learning_rate: 0.001 tacotron_natural_eval: False tacotron_num_gpus: 1 tacotron_random_seed: 5339 tacotron_reg_weight: 1e-06 tacotron_scale_regularization: False tacotron_start_decay: 40000 tacotron_swap_with_cpu: False tacotron_synthesis_batch_size: 1 tacotron_teacher_forcing_decay_alpha: None tacotron_teacher_forcing_decay_steps: 40000 tacotron_teacher_forcing_final_ratio: 0.0 tacotron_teacher_forcing_init_ratio: 1.0 tacotron_teacher_forcing_mode: constant tacotron_teacher_forcing_ratio: 1.0 tacotron_teacher_forcing_start_decay: 10000 tacotron_test_batches: None tacotron_test_size: 0.05 tacotron_zoneout_rate: 0.1 train_with_GTA: True trim_fft_size: 2048 trim_hop_size: 512 trim_silence: True trim_top_db: 40 upsample_activation: Relu upsample_scales: [11, 25] upsample_type: SubPixel use_bias: True use_lws: False use_speaker_embedding: True wavenet_adam_beta1: 0.9 wavenet_adam_beta2: 0.999 wavenet_adam_epsilon: 1e-06 wavenet_batch_size: 8 wavenet_clip_gradients: True wavenet_data_random_state: 1234 wavenet_debug_mels: ['training_data/mels/mel-LJ001-0008.npy'] wavenet_debug_wavs: ['training_data/audio/audio-LJ001-0008.npy'] wavenet_decay_rate: 0.5 wavenet_decay_steps: 200000 wavenet_dropout: 0.05 wavenet_ema_decay: 0.9999 wavenet_gradient_max_norm: 100.0 wavenet_gradient_max_value: 5.0 wavenet_init_scale: 1.0 wavenet_learning_rate: 0.001 wavenet_lr_schedule: exponential wavenet_natural_eval: False wavenet_num_gpus: 1 wavenet_pad_sides: 1 wavenet_random_seed: 5339 wavenet_swap_with_cpu: False wavenet_synth_debug: False wavenet_synthesis_batch_size: 20 wavenet_test_batches: 1 wavenet_test_size: None wavenet_warmup: 4000.0 wavenet_weight_normalization: False win_size: 1100 Constructing model: Tacotron

Initialized Tacotron model. Dimensions (? = dynamic shape): Train mode: False Eval mode: False GTA mode: False Synthesis mode: True Input: (?, ?) device: 0 embedding: (?, ?, 512) enc conv out: (?, ?, 512) encoder out: (?, ?, 512) decoder out: (?, ?, 80) residual out: (?, ?, 512) projected residual out: (?, ?, 80) mel out: (?, ?, 80) linear out: (?, ?, 551)

out: (?, ?) Tacotron Parameters 29.023 Million. Loading checkpoint: logs-Tacotron/taco_pretrained/tacotron_model.ckpt-1800 WARNING:tensorflow:From /Users/davidecangelosi/Desktop/workspace/venv3_01/lib/python3.7/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. Starting Synthesis 0%| | 0/13 [00:00 main() File "synthesize.py", line 90, in main _ = tacotron_synthesize(args, hparams, taco_checkpoint, sentences) File "/Users/davidecangelosi/Desktop/workspace/venv3_01/Tacotron-2/tacotron/synthesize.py", line 136, in tacotron_synthesize return run_eval(args, checkpoint_path, output_dir, hparams, sentences) File "/Users/davidecangelosi/Desktop/workspace/venv3_01/Tacotron-2/tacotron/synthesize.py", line 69, in run_eval mel_filenames, speaker_ids = synth.synthesize(texts, basenames, eval_dir, log_dir, None) File "/Users/davidecangelosi/Desktop/workspace/venv3_01/Tacotron-2/tacotron/synthesizer.py", line 219, in synthesize audio.save_wav(wav, os.path.join(log_dir, 'wavs/wav-{}-mel.wav'.format(basenames[i])), sr=hparams.sample_rate) File "/Users/davidecangelosi/Desktop/workspace/venv3_01/Tacotron-2/datasets/audio.py", line 13, in save_wav wav *= 32767 / max(0.01, np.max(np.abs(wav))) File "/Users/davidecangelosi/Desktop/workspace/venv3_01/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2320, in amax out=out, **kwargs) File "/Users/davidecangelosi/Desktop/workspace/venv3_01/lib/python3.7/site-packages/numpy/core/_methods.py", line 26, in _amax return umr_maximum(a, axis, None, out, keepdims) ValueError: zero-size array to reduction operation maximum which has no identity I quite frustrated from weeks of attempts to solve this issue. Can someone help me ? Thank you very much
Stanley80 commented 4 years ago

I read that \ufeff is the BOM or "Byte Order Mark".