TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.82k stars 812 forks source link

Inference failed, ValueError: Shapes (149, 512) and (199, 512) are incompatible #382

Closed ErfolgreichCharismatisch closed 3 years ago

ErfolgreichCharismatisch commented 3 years ago

I am using batch size 2 in tacotron2.v1.yaml with a custom 22050 Hz, mono, 10h total myDataset. I prepared it and normalized it. I started

python examples/tacotron2/train_tacotron2.py --train-dir ./dump_myDataset/train/ --dev-dir ./dump_myDataset/valid/ --outdir ./examples/tacotron2/exp/train.tacotron2.v1/ --config ./examples/tacotron2/conf/tacotron2.v1.yaml --use-norm 1 --mixed_precision 0 --resume ""

and waited for a few checkpoints.

I am trying to use the standard python script for inference with my checkpoint model-2073.h5 and the pretrained melgan model. I am using the ljspeech_mapper.json from the dump folder of my model.

I get

Traceback (most recent call last):
  File "standardpythonscript.py", line 15, in <module>
    pretrained_path="./examples/tacotron2/exp/train.tacotron2.v1/checkpoints/model-2073.h5"
  File "E:\tts\inference\auto_model.py", line 69, in from_pretrained
    model.load_weights(pretrained_path)
  File "E:\Anaconda\envs\myEnv\lib\site-packages\tensorflow\python\keras\engine\training.py", line 2211, in load_weights
    hdf5_format.load_weights_from_hdf5_group(f, self.layers)
  File "E:\Anaconda\envs\myEnv\lib\site-packages\tensorflow\python\keras\saving\hdf5_format.py", line 708, in load_weights_from_hdf5_group
    K.batch_set_value(weight_value_tuples)
  File "E:\Anaconda\envs\myEnv\lib\site-packages\tensorflow\python\util\dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "E:\Anaconda\envs\myEnv\lib\site-packages\tensorflow\python\keras\backend.py", line 3576, in batch_set_value
    x.assign(np.asarray(value, dtype=dtype(x)))
  File "E:\Anaconda\envs\myEnv\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py", line 858, in assign
    self._shape.assert_is_compatible_with(value_tensor.shape)
  File "E:\Anaconda\envs\myEnv\lib\site-packages\tensorflow\python\framework\tensor_shape.py", line 1134, in assert_is_compatible_with
    raise ValueError("Shapes %s and %s are incompatible" % (self, other))
ValueError: Shapes (149, 512) and (199, 512) are incompatible
import numpy as np
import soundfile as sf
import yaml

import tensorflow as tf

from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor

# initialize tacotron2 model.
fs_config = AutoConfig.from_pretrained('./examples/tacotron2/conf/tacotron2.v1.yaml')
tacotron2 = TFAutoModel.from_pretrained(
    config=fs_config,
    pretrained_path="./examples/tacotron2/exp/train.tacotron2.v1/checkpoints/model-2073.h5"
)

# initialize melgan model
melgan_config = AutoConfig.from_pretrained('./examples/melgan/conf/melgan.v1.yaml')
melgan = TFAutoModel.from_pretrained(
    config=melgan_config,
    pretrained_path="./examples/melgan/checkpoint/generator-1500000.h5"
)

# inference
processor = AutoProcessor.from_pretrained(pretrained_path="./dump_myDataset/ljspeech_mapper.json")

ids = processor.text_to_sequence("my german testing sentence")
ids = tf.expand_dims(ids, 0)
# tacotron2 inference

masked_mel_before, masked_mel_after, duration_outputs = tacotron2.inference(
    ids,
    speaker_ids=tf.zeros(shape=[tf.shape(ids)[0]], dtype=tf.int32),
    speed_ratios=tf.constant([1.0], dtype=tf.float32)
)

# melgan inference
audio_before = melgan.inference(masked_mel_before)[0, :, 0]
audio_after = melgan.inference(masked_mel_after)[0, :, 0]

# save to file
sf.write('./audio_before.wav', audio_before, 22050, "PCM_16")
sf.write('./audio_after.wav', audio_after, 22050, "PCM_16")

Ideas?

dathudeptrai commented 3 years ago

@ErfolgreichCharismatisch can you share ur mapper.json ?. 149 is the number of ljspeech charactor, but seem ur model have 199 charactor/phoneme input. Did you change ur config file ?, please see here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/tacotron2/conf/tacotron2.v1.yaml#L19). you should write ur own processor and ur symbols set then add ur vocab_size here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/configs/tacotron2.py#L59-L68)

ErfolgreichCharismatisch commented 3 years ago

ljspeech_mapper.json

{"symbol_to_id": {"pad": 0, "-": 1, "!": 2, "'": 3, "(": 4, ")": 5, ",": 6, ".": 7, ":": 8, ";": 9, "?": 10, " ": 11, "a": 12, "A": 13, "à": 14, "À": 15, "ả": 16, "Ả": 17, "ã": 18, "Ã": 19, "á": 20, "Á": 21, "ạ": 22, "Ạ": 23, "ă": 24, "Ă": 25, "ằ": 26, "Ằ": 27, "ẳ": 28, "Ẳ": 29, "ẵ": 30, "Ẵ": 31, "ắ": 32, "Ắ": 33, "ặ": 34, "Ặ": 35, "â": 36, "Â": 37, "ầ": 38, "Ầ": 39, "ẩ": 40, "Ẩ": 41, "ẫ": 42, "Ẫ": 43, "ấ": 44, "Ấ": 45, "ậ": 46, "Ậ": 47, "b": 48, "B": 49, "c": 50, "C": 51, "d": 52, "D": 53, "đ": 54, "Đ": 55, "e": 56, "E": 57, "è": 58, "È": 59, "ẻ": 60, "Ẻ": 61, "ẽ": 62, "Ẽ": 63, "é": 64, "É": 65, "ẹ": 66, "Ẹ": 67, "ê": 68, "Ê": 69, "ề": 70, "Ề": 71, "ể": 72, "Ể": 73, "ễ": 74, "Ễ": 75, "ế": 76, "Ế": 77, "ệ": 78, "Ệ": 79, "f": 80, "F": 81, "g": 82, "G": 83, "h": 84, "H": 85, "i": 86, "I": 87, "ì": 88, "Ì": 89, "ỉ": 90, "Ỉ": 91, "ĩ": 92, "Ĩ": 93, "í": 94, "Í": 95, "ị": 96, "Ị": 97, "j": 98, "J": 99, "k": 100, "K": 101, "l": 102, "L": 103, "m": 104, "M": 105, "n": 106, "N": 107, "o": 108, "O": 109, "ò": 110, "Ò": 111, "ỏ": 112, "Ỏ": 113, "õ": 114, "Õ": 115, "ó": 116, "Ó": 117, "ọ": 118, "Ọ": 119, "ô": 120, "Ô": 121, "ồ": 122, "Ồ": 123, "ổ": 124, "Ổ": 125, "ỗ": 126, "Ỗ": 127, "ố": 128, "Ố": 129, "ộ": 130, "Ộ": 131, "ơ": 132, "Ơ": 133, "ờ": 134, "Ờ": 135, "ở": 136, "Ở": 137, "ỡ": 138, "Ỡ": 139, "ớ": 140, "Ớ": 141, "ợ": 142, "Ợ": 143, "p": 144, "P": 145, "q": 146, "Q": 147, "r": 148, "R": 149, "s": 150, "S": 151, "t": 152, "T": 153, "u": 154, "U": 155, "ù": 156, "Ù": 157, "ủ": 158, "Ủ": 159, "ũ": 160, "Ũ": 161, "ú": 162, "Ú": 163, "ụ": 164, "Ụ": 165, "ư": 166, "Ư": 167, "ừ": 168, "Ừ": 169, "ử": 170, "Ử": 171, "ữ": 172, "Ữ": 173, "ứ": 174, "Ứ": 175, "ự": 176, "Ự": 177, "v": 178, "V": 179, "w": 180, "W": 181, "x": 182, "X": 183, "y": 184, "Y": 185, "ỳ": 186, "Ỳ": 187, "ỷ": 188, "Ỷ": 189, "ỹ": 190, "Ỹ": 191, "ý": 192, "Ý": 193, "ỵ": 194, "Ỵ": 195, "z": 196, "Z": 197, "eos": 198}, "id_to_symbol": {"0": "pad", "1": "-", "2": "!", "3": "'", "4": "(", "5": ")", "6": ",", "7": ".", "8": ":", "9": ";", "10": "?", "11": " ", "12": "a", "13": "A", "14": "à", "15": "À", "16": "ả", "17": "Ả", "18": "ã", "19": "Ã", "20": "á", "21": "Á", "22": "ạ", "23": "Ạ", "24": "ă", "25": "Ă", "26": "ằ", "27": "Ằ", "28": "ẳ", "29": "Ẳ", "30": "ẵ", "31": "Ẵ", "32": "ắ", "33": "Ắ", "34": "ặ", "35": "Ặ", "36": "â", "37": "Â", "38": "ầ", "39": "Ầ", "40": "ẩ", "41": "Ẩ", "42": "ẫ", "43": "Ẫ", "44": "ấ", "45": "Ấ", "46": "ậ", "47": "Ậ", "48": "b", "49": "B", "50": "c", "51": "C", "52": "d", "53": "D", "54": "đ", "55": "Đ", "56": "e", "57": "E", "58": "è", "59": "È", "60": "ẻ", "61": "Ẻ", "62": "ẽ", "63": "Ẽ", "64": "é", "65": "É", "66": "ẹ", "67": "Ẹ", "68": "ê", "69": "Ê", "70": "ề", "71": "Ề", "72": "ể", "73": "Ể", "74": "ễ", "75": "Ễ", "76": "ế", "77": "Ế", "78": "ệ", "79": "Ệ", "80": "f", "81": "F", "82": "g", "83": "G", "84": "h", "85": "H", "86": "i", "87": "I", "88": "ì", "89": "Ì", "90": "ỉ", "91": "Ỉ", "92": "ĩ", "93": "Ĩ", "94": "í", "95": "Í", "96": "ị", "97": "Ị", "98": "j", "99": "J", "100": "k", "101": "K", "102": "l", "103": "L", "104": "m", "105": "M", "106": "n", "107": "N", "108": "o", "109": "O", "110": "ò", "111": "Ò", "112": "ỏ", "113": "Ỏ", "114": "õ", "115": "Õ", "116": "ó", "117": "Ó", "118": "ọ", "119": "Ọ", "120": "ô", "121": "Ô", "122": "ồ", "123": "Ồ", "124": "ổ", "125": "Ổ", "126": "ỗ", "127": "Ỗ", "128": "ố", "129": "Ố", "130": "ộ", "131": "Ộ", "132": "ơ", "133": "Ơ", "134": "ờ", "135": "Ờ", "136": "ở", "137": "Ở", "138": "ỡ", "139": "Ỡ", "140": "ớ", "141": "Ớ", "142": "ợ", "143": "Ợ", "144": "p", "145": "P", "146": "q", "147": "Q", "148": "r", "149": "R", "150": "s", "151": "S", "152": "t", "153": "T", "154": "u", "155": "U", "156": "ù", "157": "Ù", "158": "ủ", "159": "Ủ", "160": "ũ", "161": "Ũ", "162": "ú", "163": "Ú", "164": "ụ", "165": "Ụ", "166": "ư", "167": "Ư", "168": "ừ", "169": "Ừ", "170": "ử", "171": "Ử", "172": "ữ", "173": "Ữ", "174": "ứ", "175": "Ứ", "176": "ự", "177": "Ự", "178": "v", "179": "V", "180": "w", "181": "W", "182": "x", "183": "X", "184": "y", "185": "Y", "186": "ỳ", "187": "Ỳ", "188": "ỷ", "189": "Ỷ", "190": "ỹ", "191": "Ỹ", "192": "ý", "193": "Ý", "194": "ỵ", "195": "Ỵ", "196": "z", "197": "Z", "198": "eos"}, "speakers_map": {"ljspeech": 0}, "processor_name": "LJSpeechProcessor"}

ErfolgreichCharismatisch commented 3 years ago

you should write ur own processor and ur symbols

How?

dathudeptrai commented 3 years ago

you should write ur own processor and ur symbols

How?

what is ur config u used for training and what is ur dataset language ?

ErfolgreichCharismatisch commented 3 years ago

language: german

config:

difference between mine and the original one

- 19:2:     dataset: ljspeech
+ 19:2:     dataset: myTrainingDataset
- 50:2: batch_size: 32             # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
+ 50:2: batch_size: 2             # Batch size.
- 55:2: use_fixed_shapes: true     # use_fixed_shapes for training (2x speed-up)
+ 55:2: use_fixed_shapes: false    # use_fixed_shapes for training (2x speed-up)
- 68:2: gradient_accumulation_steps: 1
- 76:2: save_interval_steps: 2000               # Interval steps to save checkpoint.
+ 76:2: save_interval_steps: 200               # Interval steps to save checkpoint.
- 87:2: 

For reference

# This is the hyperparameter configuration file for Tacotron2 v1.
# Please make sure this is adjusted for the LJSpeech dataset. If you want to
# apply to the other dataset, you might need to carefully change some parameters.
# This configuration performs 200k iters but 65k iters is enough to get a good models.

###########################################################
#                FEATURE EXTRACTION SETTING               #
###########################################################
hop_size: 256            # Hop size.
format: "npy"

###########################################################
#              NETWORK ARCHITECTURE SETTING               #
###########################################################
model_type: "tacotron2"

tacotron2_params:
    dataset: myTrainingDataset
    embedding_hidden_size: 512
    initializer_range: 0.02
    embedding_dropout_prob: 0.1
    n_speakers: 1
    n_conv_encoder: 5
    encoder_conv_filters: 512
    encoder_conv_kernel_sizes: 5
    encoder_conv_activation: 'relu'
    encoder_conv_dropout_rate: 0.5
    encoder_lstm_units: 256
    n_prenet_layers: 2
    prenet_units: 256
    prenet_activation: 'relu'
    prenet_dropout_rate: 0.5
    n_lstm_decoder: 1
    reduction_factor: 1
    decoder_lstm_units: 1024
    attention_dim: 128
    attention_filters: 32
    attention_kernel: 31
    n_mels: 80
    n_conv_postnet: 5
    postnet_conv_filters: 512
    postnet_conv_kernel_sizes: 5
    postnet_dropout_rate: 0.1
    attention_type: "lsa"

###########################################################
#                  DATA LOADER SETTING                    #
###########################################################
batch_size: 2             # Batch size.
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true          # Whether to allow cache in dataset. If true, it requires cpu memory.
mel_length_threshold: 32   # remove all targets has mel_length <= 32 
is_shuffle: true           # shuffle dataset after each epoch.
use_fixed_shapes: false    # use_fixed_shapes for training (2x speed-up)
                           # refer (https://github.com/dathudeptrai/TensorflowTTS/issues/34#issuecomment-642309118)

###########################################################
#             OPTIMIZER & SCHEDULER SETTING               #
###########################################################
optimizer_params:
    initial_learning_rate: 0.001
    end_learning_rate: 0.00001
    decay_steps: 150000          # < train_max_steps is recommend.
    warmup_proportion: 0.02
    weight_decay: 0.001

var_train_expr: null  # trainable variable expr (eg. 'embeddings|decoder_cell' )
                      # must separate by |. if var_train_expr is null then we 
                      # training all variables.
###########################################################
#                    INTERVAL SETTING                     #
###########################################################
train_max_steps: 200000                 # Number of training steps.
save_interval_steps: 200               # Interval steps to save checkpoint.
eval_interval_steps: 500                # Interval steps to evaluate the network.
log_interval_steps: 200                 # Interval steps to record the training log.
start_schedule_teacher_forcing: 200001  # don't need to apply schedule teacher forcing.
start_ratio_value: 0.5                  # start ratio of scheduled teacher forcing.
schedule_decay_steps: 50000             # decay step scheduled teacher forcing.
end_ratio_value: 0.0                    # end ratio of scheduled teacher forcing.
###########################################################
#                     OTHER SETTING                       #
###########################################################
num_save_intermediate_results: 1  # Number of results to be saved as intermediate results.
dathudeptrai commented 3 years ago

myTrainingDataset

i saw you use tacotron2.v1.yaml in ur training script ?. Ur dataset is german so why u use ljspeech mapper ?. Did you add ur dataset here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/configs/tacotron2.py#L59-L68)

ErfolgreichCharismatisch commented 3 years ago

Yes, I did:

        elif dataset == "myTrainingDataset":
            self.vocab_size = 199

I am using the mapper that was being created in the dump folder after tensorflow-tts-preprocess and tensorflow-tts-normalize.

ErfolgreichCharismatisch commented 3 years ago

To make this easier, can you outline the process of using your own dataset(german in my case) in the ljspeech format(

|- [NAME_DATASET]/
|   |- metadata.csv
|   |- wav/
|       |- file1.wav
|       |- ...

) to first train to then use inference on.

Or show a general approach and point out the differences in a concise and structed manner either here or on the main site.

Would be highly appreciated.

dathudeptrai commented 3 years ago

@ErfolgreichCharismatisch did you add to here also ? (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/bin/preprocess.py#L347-L366). Seem the bug is caused by using the wrong config when training, that means you training by ljspeech symbols (149 charactor)

ErfolgreichCharismatisch commented 3 years ago

@dathudeptrai Please refer to https://github.com/TensorSpeech/TensorFlowTTS/issues/382#issuecomment-733060825

dathudeptrai commented 3 years ago

@dathudeptrai Please refer to #382 (comment)

i will make the wiki for training new dataset with new languages. :D

ErfolgreichCharismatisch commented 3 years ago

Great. Please reply here when done.

ErfolgreichCharismatisch commented 3 years ago

Just start with the wiki and leave it open for someone else to finish...

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.