Closed ErfolgreichCharismatisch closed 3 years ago
@ErfolgreichCharismatisch can you share ur mapper.json ?. 149 is the number of ljspeech charactor, but seem ur model have 199 charactor/phoneme input. Did you change ur config file ?, please see here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/tacotron2/conf/tacotron2.v1.yaml#L19). you should write ur own processor and ur symbols set then add ur vocab_size here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/configs/tacotron2.py#L59-L68)
ljspeech_mapper.json
{"symbol_to_id": {"pad": 0, "-": 1, "!": 2, "'": 3, "(": 4, ")": 5, ",": 6, ".": 7, ":": 8, ";": 9, "?": 10, " ": 11, "a": 12, "A": 13, "à": 14, "À": 15, "ả": 16, "Ả": 17, "ã": 18, "Ã": 19, "á": 20, "Á": 21, "ạ": 22, "Ạ": 23, "ă": 24, "Ă": 25, "ằ": 26, "Ằ": 27, "ẳ": 28, "Ẳ": 29, "ẵ": 30, "Ẵ": 31, "ắ": 32, "Ắ": 33, "ặ": 34, "Ặ": 35, "â": 36, "Â": 37, "ầ": 38, "Ầ": 39, "ẩ": 40, "Ẩ": 41, "ẫ": 42, "Ẫ": 43, "ấ": 44, "Ấ": 45, "ậ": 46, "Ậ": 47, "b": 48, "B": 49, "c": 50, "C": 51, "d": 52, "D": 53, "đ": 54, "Đ": 55, "e": 56, "E": 57, "è": 58, "È": 59, "ẻ": 60, "Ẻ": 61, "ẽ": 62, "Ẽ": 63, "é": 64, "É": 65, "ẹ": 66, "Ẹ": 67, "ê": 68, "Ê": 69, "ề": 70, "Ề": 71, "ể": 72, "Ể": 73, "ễ": 74, "Ễ": 75, "ế": 76, "Ế": 77, "ệ": 78, "Ệ": 79, "f": 80, "F": 81, "g": 82, "G": 83, "h": 84, "H": 85, "i": 86, "I": 87, "ì": 88, "Ì": 89, "ỉ": 90, "Ỉ": 91, "ĩ": 92, "Ĩ": 93, "í": 94, "Í": 95, "ị": 96, "Ị": 97, "j": 98, "J": 99, "k": 100, "K": 101, "l": 102, "L": 103, "m": 104, "M": 105, "n": 106, "N": 107, "o": 108, "O": 109, "ò": 110, "Ò": 111, "ỏ": 112, "Ỏ": 113, "õ": 114, "Õ": 115, "ó": 116, "Ó": 117, "ọ": 118, "Ọ": 119, "ô": 120, "Ô": 121, "ồ": 122, "Ồ": 123, "ổ": 124, "Ổ": 125, "ỗ": 126, "Ỗ": 127, "ố": 128, "Ố": 129, "ộ": 130, "Ộ": 131, "ơ": 132, "Ơ": 133, "ờ": 134, "Ờ": 135, "ở": 136, "Ở": 137, "ỡ": 138, "Ỡ": 139, "ớ": 140, "Ớ": 141, "ợ": 142, "Ợ": 143, "p": 144, "P": 145, "q": 146, "Q": 147, "r": 148, "R": 149, "s": 150, "S": 151, "t": 152, "T": 153, "u": 154, "U": 155, "ù": 156, "Ù": 157, "ủ": 158, "Ủ": 159, "ũ": 160, "Ũ": 161, "ú": 162, "Ú": 163, "ụ": 164, "Ụ": 165, "ư": 166, "Ư": 167, "ừ": 168, "Ừ": 169, "ử": 170, "Ử": 171, "ữ": 172, "Ữ": 173, "ứ": 174, "Ứ": 175, "ự": 176, "Ự": 177, "v": 178, "V": 179, "w": 180, "W": 181, "x": 182, "X": 183, "y": 184, "Y": 185, "ỳ": 186, "Ỳ": 187, "ỷ": 188, "Ỷ": 189, "ỹ": 190, "Ỹ": 191, "ý": 192, "Ý": 193, "ỵ": 194, "Ỵ": 195, "z": 196, "Z": 197, "eos": 198}, "id_to_symbol": {"0": "pad", "1": "-", "2": "!", "3": "'", "4": "(", "5": ")", "6": ",", "7": ".", "8": ":", "9": ";", "10": "?", "11": " ", "12": "a", "13": "A", "14": "à", "15": "À", "16": "ả", "17": "Ả", "18": "ã", "19": "Ã", "20": "á", "21": "Á", "22": "ạ", "23": "Ạ", "24": "ă", "25": "Ă", "26": "ằ", "27": "Ằ", "28": "ẳ", "29": "Ẳ", "30": "ẵ", "31": "Ẵ", "32": "ắ", "33": "Ắ", "34": "ặ", "35": "Ặ", "36": "â", "37": "Â", "38": "ầ", "39": "Ầ", "40": "ẩ", "41": "Ẩ", "42": "ẫ", "43": "Ẫ", "44": "ấ", "45": "Ấ", "46": "ậ", "47": "Ậ", "48": "b", "49": "B", "50": "c", "51": "C", "52": "d", "53": "D", "54": "đ", "55": "Đ", "56": "e", "57": "E", "58": "è", "59": "È", "60": "ẻ", "61": "Ẻ", "62": "ẽ", "63": "Ẽ", "64": "é", "65": "É", "66": "ẹ", "67": "Ẹ", "68": "ê", "69": "Ê", "70": "ề", "71": "Ề", "72": "ể", "73": "Ể", "74": "ễ", "75": "Ễ", "76": "ế", "77": "Ế", "78": "ệ", "79": "Ệ", "80": "f", "81": "F", "82": "g", "83": "G", "84": "h", "85": "H", "86": "i", "87": "I", "88": "ì", "89": "Ì", "90": "ỉ", "91": "Ỉ", "92": "ĩ", "93": "Ĩ", "94": "í", "95": "Í", "96": "ị", "97": "Ị", "98": "j", "99": "J", "100": "k", "101": "K", "102": "l", "103": "L", "104": "m", "105": "M", "106": "n", "107": "N", "108": "o", "109": "O", "110": "ò", "111": "Ò", "112": "ỏ", "113": "Ỏ", "114": "õ", "115": "Õ", "116": "ó", "117": "Ó", "118": "ọ", "119": "Ọ", "120": "ô", "121": "Ô", "122": "ồ", "123": "Ồ", "124": "ổ", "125": "Ổ", "126": "ỗ", "127": "Ỗ", "128": "ố", "129": "Ố", "130": "ộ", "131": "Ộ", "132": "ơ", "133": "Ơ", "134": "ờ", "135": "Ờ", "136": "ở", "137": "Ở", "138": "ỡ", "139": "Ỡ", "140": "ớ", "141": "Ớ", "142": "ợ", "143": "Ợ", "144": "p", "145": "P", "146": "q", "147": "Q", "148": "r", "149": "R", "150": "s", "151": "S", "152": "t", "153": "T", "154": "u", "155": "U", "156": "ù", "157": "Ù", "158": "ủ", "159": "Ủ", "160": "ũ", "161": "Ũ", "162": "ú", "163": "Ú", "164": "ụ", "165": "Ụ", "166": "ư", "167": "Ư", "168": "ừ", "169": "Ừ", "170": "ử", "171": "Ử", "172": "ữ", "173": "Ữ", "174": "ứ", "175": "Ứ", "176": "ự", "177": "Ự", "178": "v", "179": "V", "180": "w", "181": "W", "182": "x", "183": "X", "184": "y", "185": "Y", "186": "ỳ", "187": "Ỳ", "188": "ỷ", "189": "Ỷ", "190": "ỹ", "191": "Ỹ", "192": "ý", "193": "Ý", "194": "ỵ", "195": "Ỵ", "196": "z", "197": "Z", "198": "eos"}, "speakers_map": {"ljspeech": 0}, "processor_name": "LJSpeechProcessor"}
you should write ur own processor and ur symbols
How?
you should write ur own processor and ur symbols
How?
what is ur config u used for training and what is ur dataset language ?
language: german
config:
difference between mine and the original one
- 19:2: dataset: ljspeech
+ 19:2: dataset: myTrainingDataset
- 50:2: batch_size: 32 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
+ 50:2: batch_size: 2 # Batch size.
- 55:2: use_fixed_shapes: true # use_fixed_shapes for training (2x speed-up)
+ 55:2: use_fixed_shapes: false # use_fixed_shapes for training (2x speed-up)
- 68:2: gradient_accumulation_steps: 1
- 76:2: save_interval_steps: 2000 # Interval steps to save checkpoint.
+ 76:2: save_interval_steps: 200 # Interval steps to save checkpoint.
- 87:2:
For reference
# This is the hyperparameter configuration file for Tacotron2 v1.
# Please make sure this is adjusted for the LJSpeech dataset. If you want to
# apply to the other dataset, you might need to carefully change some parameters.
# This configuration performs 200k iters but 65k iters is enough to get a good models.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
hop_size: 256 # Hop size.
format: "npy"
###########################################################
# NETWORK ARCHITECTURE SETTING #
###########################################################
model_type: "tacotron2"
tacotron2_params:
dataset: myTrainingDataset
embedding_hidden_size: 512
initializer_range: 0.02
embedding_dropout_prob: 0.1
n_speakers: 1
n_conv_encoder: 5
encoder_conv_filters: 512
encoder_conv_kernel_sizes: 5
encoder_conv_activation: 'relu'
encoder_conv_dropout_rate: 0.5
encoder_lstm_units: 256
n_prenet_layers: 2
prenet_units: 256
prenet_activation: 'relu'
prenet_dropout_rate: 0.5
n_lstm_decoder: 1
reduction_factor: 1
decoder_lstm_units: 1024
attention_dim: 128
attention_filters: 32
attention_kernel: 31
n_mels: 80
n_conv_postnet: 5
postnet_conv_filters: 512
postnet_conv_kernel_sizes: 5
postnet_dropout_rate: 0.1
attention_type: "lsa"
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 2 # Batch size.
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
mel_length_threshold: 32 # remove all targets has mel_length <= 32
is_shuffle: true # shuffle dataset after each epoch.
use_fixed_shapes: false # use_fixed_shapes for training (2x speed-up)
# refer (https://github.com/dathudeptrai/TensorflowTTS/issues/34#issuecomment-642309118)
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
optimizer_params:
initial_learning_rate: 0.001
end_learning_rate: 0.00001
decay_steps: 150000 # < train_max_steps is recommend.
warmup_proportion: 0.02
weight_decay: 0.001
var_train_expr: null # trainable variable expr (eg. 'embeddings|decoder_cell' )
# must separate by |. if var_train_expr is null then we
# training all variables.
###########################################################
# INTERVAL SETTING #
###########################################################
train_max_steps: 200000 # Number of training steps.
save_interval_steps: 200 # Interval steps to save checkpoint.
eval_interval_steps: 500 # Interval steps to evaluate the network.
log_interval_steps: 200 # Interval steps to record the training log.
start_schedule_teacher_forcing: 200001 # don't need to apply schedule teacher forcing.
start_ratio_value: 0.5 # start ratio of scheduled teacher forcing.
schedule_decay_steps: 50000 # decay step scheduled teacher forcing.
end_ratio_value: 0.0 # end ratio of scheduled teacher forcing.
###########################################################
# OTHER SETTING #
###########################################################
num_save_intermediate_results: 1 # Number of results to be saved as intermediate results.
myTrainingDataset
i saw you use tacotron2.v1.yaml in ur training script ?. Ur dataset is german so why u use ljspeech mapper ?. Did you add ur dataset here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/configs/tacotron2.py#L59-L68)
Yes, I did:
elif dataset == "myTrainingDataset":
self.vocab_size = 199
I am using the mapper that was being created in the dump folder after tensorflow-tts-preprocess and tensorflow-tts-normalize.
To make this easier, can you outline the process of using your own dataset(german in my case) in the ljspeech format(
|- [NAME_DATASET]/
| |- metadata.csv
| |- wav/
| |- file1.wav
| |- ...
) to first train to then use inference on.
Or show a general approach and point out the differences in a concise and structed manner either here or on the main site.
Would be highly appreciated.
@ErfolgreichCharismatisch did you add to here also ? (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/bin/preprocess.py#L347-L366). Seem the bug is caused by using the wrong config when training, that means you training by ljspeech symbols (149 charactor)
@dathudeptrai Please refer to https://github.com/TensorSpeech/TensorFlowTTS/issues/382#issuecomment-733060825
@dathudeptrai Please refer to #382 (comment)
i will make the wiki for training new dataset with new languages. :D
Great. Please reply here when done.
Just start with the wiki and leave it open for someone else to finish...
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
I am using batch size 2 in
tacotron2.v1.yaml
with a custom 22050 Hz, mono, 10h totalmyDataset
. I prepared it and normalized it. I startedpython examples/tacotron2/train_tacotron2.py --train-dir ./dump_myDataset/train/ --dev-dir ./dump_myDataset/valid/ --outdir ./examples/tacotron2/exp/train.tacotron2.v1/ --config ./examples/tacotron2/conf/tacotron2.v1.yaml --use-norm 1 --mixed_precision 0 --resume ""
and waited for a few checkpoints.
I am trying to use the standard python script for inference with my checkpoint
model-2073.h5
and the pretrained melgan model. I am using theljspeech_mapper.json
from the dump folder of my model.I get
Ideas?