Closed junaedifahmi closed 3 years ago
When you prepared your dataset did you fix the frame differences?
See this readme. The specific script is at the bottom, fix_mismatch.py
Looks like the error is in the encoder. I believeI had this problem when I was using incompatible ljspeech_mapper.json when preparing data and when training. In my case it was a result of having both pip installed version, and an updated git source version on the same computer.
When you prepared your dataset did you fix the frame differences?
See this readme. The specific script is at the bottom,
fix_mismatch.py
I did it and I think it works fine. This is the output that I got from doing that
root@aa1de9a7d6a4:/workspace# python examples/mfa_extraction/fix_mismatch.py --base_path ./ljspeech/ --trimmed_dur_path ./LJSpeech_ln/trimmed-durations --dur_path LJSpeech_ln/durations/
2020-10-20 07:55:28,464 (fix_mismatch:46) INFO: FIXING train set ... 100%|█████████████████████████████████████████████████████| 22202/22202 [00:29<00:00, 742.88it/s]
2020-10-20 07:55:58,377 (fix_mismatch:107) INFO: train stats: number of mfa with longer duration: 11731, total diff: 36139, mean diff: 3.0806410365697725
2020-10-20 07:55:58,380 (fix_mismatch:111) INFO: train stats: number of mfa with shorter duration: 6899, total diff: 12333, mean diff: 1.7876503841136397 2020-10-20 07:55:58,380 (fix_mismatch:115) INFO: train stats: number of files with a ''big'' duration diff: 0 if number>1 you should check it 2020-10-20 07:55:58,380 (fix_mismatch:117) INFO: train stats: not fixed len: 0
2020-10-20 07:55:58,382 (fix_mismatch:46) INFO: FIXING valid set ... 100%|███████████████████████████████████████████████████████| 1169/1169 [00:09<00:00, 129.21it/s]
2020-10-20 07:56:07,433 (fix_mismatch:107) INFO: valid stats: number of mfa with longer duration: 613, total diff: 1980, mean diff: 3.230016313213703 2020-10-20 07:56:07,433 (fix_mismatch:111) INFO: valid stats: number of mfa with shorter duration: 361, total diff: 669, mean diff: 1.853185595567867
2020-10-20 07:56:07,433 (fix_mismatch:115) INFO: valid stats: number of files with a ''big'' duration diff: 0 if number>1 you should check it 2020-10-20 07:56:07,434 (fix_mismatch:117) INFO: valid stats: not fixed len: 0
But the problem is still the same.
@juunnn Hi, I have the same problem as yours, have you found a solution?
a bug tell that the length of character/phoneme input is different compared with f0/energy embeddings. You guys should print the shape and check this line (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/fastspeech2.py#L185). To enable debug mode, you should add tf.config.run_functions_eagerly(True)
in the head of the training file to enable eager mode.
@juunnn Hi, I have the same problem as yours, have you found a solution?
To use mfa extraction I don't have a clue, but I can move forward by first training tacotron2 and use the extract duration there. Surprisingly my tacotron2 training can produce good sound with only 25 epoch.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
i have same issue 2020-12-28 11:03:06,585 (train_fastspeech2:299) INFO: hop_size = 256 2020-12-28 11:03:06,585 (train_fastspeech2:299) INFO: format = npy 2020-12-28 11:03:06,585 (train_fastspeech2:299) INFO: model_type = fastspeech2 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: fastspeech2_params = {'n_speakers': 1, 'encoder_hidden_size': 384, 'encoder_num_hidden_layers': 4, 'encoder_num_attention_heads': 2, 'encoder_attention_head_size': 192, 'encoder_intermediate_size': 1024, 'encoder_intermediate_kernel_size': 3, 'encoder_hidden_act': 'mish', 'decoder_hidden_size': 384, 'decoder_num_hidden_layers': 4, 'decoder_num_attention_heads': 2, 'decoder_attention_head_size': 192, 'decoder_intermediate_size': 1024, 'decoder_intermediate_kernel_size': 3, 'decoder_hidden_act': 'mish', 'variant_prediction_num_conv_layers': 2, 'variant_predictor_filter': 256, 'variant_predictor_kernel_size': 3, 'variant_predictor_dropout_rate': 0.5, 'num_mels': 80, 'hidden_dropout_prob': 0.2, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 2048, 'initializer_range': 0.02, 'output_attentions': False, 'output_hidden_states': False} 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: batch_size = 16 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: remove_short_samples = True 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: allow_cache = True 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: mel_length_threshold = 32 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: is_shuffle = True 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: optimizer_params = {'initial_learning_rate': 0.001, 'end_learning_rate': 5e-05, 'decay_steps': 150000, 'warmup_proportion': 0.02, 'weight_decay': 0.001} 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: gradient_accumulation_steps = 1 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: var_train_expr = None 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: train_max_steps = 200000 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: save_interval_steps = 5000 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: eval_interval_steps = 500 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: log_interval_steps = 200 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: num_save_intermediate_results = 1 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: train_dir = ./dump_ljspeech/train/ 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: dev_dir = ./dump_ljspeech/valid/ 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: use_norm = True 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: f0_stat = ./dump_ljspeech/stats_f0.npy 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: energy_stat = ./dump_ljspeech/stats_energy.npy 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: outdir = ./examples/fastspeech2/exp/train.fastspeech2.v1/ 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: config = ./examples/fastspeech2/conf/fastspeech2.v1.yaml 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: resume = 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: verbose = 1 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: mixed_precision = True 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: pretrained = 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: version = 0.0 2020-12-28 11:03:07,557 (api:603) INFO: charactor Tensor("PyFunc_1:0", dtype=int32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,094 (api:603) INFO: duration Tensor("PyFunc_2:0", dtype=int32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,098 (api:603) INFO: f0 Tensor("PyFunc_7:0", dtype=float32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,103 (api:603) INFO: energy Tensor("PyFunc_8:0", dtype=float32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,108 (api:603) INFO: mel Tensor("PyFunc:0", dtype=float32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,183 (api:603) INFO: charactor Tensor("PyFunc_1:0", dtype=int32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,188 (api:603) INFO: duration Tensor("PyFunc_2:0", dtype=int32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,193 (api:603) INFO: f0 Tensor("PyFunc_7:0", dtype=float32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,198 (api:603) INFO: energy Tensor("PyFunc_8:0", dtype=float32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,202 (api:603) INFO: mel Tensor("PyFunc:0", dtype=float32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:10.603306: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2020-12-28 11:03:10.735770: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 f0_embedding (1, 10, 384) energey_embedding (1, 10, 384) Model: "tf_fast_speech2"
embeddings (TFFastSpeechEmbe multiple 844032
encoder (TFFastSpeechEncoder multiple 11814400
length_regulator (TFFastSpee multiple 0
decoder (TFFastSpeechDecoder multiple 12601216
mel_before (Dense) multiple 30800
postnet (TFTacotronPostnet) multiple 4352400
f0_predictor (TFFastSpeechVa multiple 493313
energy_predictor (TFFastSpee multiple 493313
duration_predictor (TFFastSp multiple 493313
f0_embeddings (Conv1D) multiple 3840
dropout_32 (Dropout) multiple 0
energy_embeddings (Conv1D) multiple 3840
Total params: 31,130,467 Trainable params: 29,552,579 Non-trainable params: 1,577,888
train: 0%| | 0/200000 [00:00<?, ?it/s]2020-12-28 11:03:11.788850: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No whitelist ops found, nothing to do
2020-12-28 11:03:11.793360: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No whitelist ops found, nothing to do
2020-12-28 11:03:21.779227: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 508 of 12445
2020-12-28 11:03:31.782370: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 983 of 12445
2020-12-28 11:03:41.807584: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 1600 of 12445
2020-12-28 11:03:51.757672: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 2320 of 12445
2020-12-28 11:04:01.761795: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 2983 of 12445
2020-12-28 11:04:11.825262: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 3440 of 12445
2020-12-28 11:04:21.895227: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 3927 of 12445
2020-12-28 11:04:31.810633: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 4470 of 12445
2020-12-28 11:04:41.767607: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 5108 of 12445
2020-12-28 11:04:51.810037: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 5574 of 12445
2020-12-28 11:05:01.835467: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 6060 of 12445
2020-12-28 11:05:11.777472: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 6557 of 12445
2020-12-28 11:05:21.782114: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 7036 of 12445
2020-12-28 11:05:31.872478: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 7485 of 12445
2020-12-28 11:05:41.853084: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 7938 of 12445
2020-12-28 11:05:51.867180: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 8709 of 12445
2020-12-28 11:06:01.797630: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 9654 of 12445
2020-12-28 11:06:11.759848: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 10172 of 12445
2020-12-28 11:06:21.800689: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 10910 of 12445
2020-12-28 11:06:31.767144: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 11603 of 12445
2020-12-28 11:06:41.762450: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 12247 of 12445
2020-12-28 11:06:44.438570: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:221] Shuffle buffer filled.
f0_embedding (16, None, 384)
energey_embedding (16, None, 384)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/indexed_slices.py:433: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
f0_embedding (16, None, 384)
energey_embedding (16, None, 384)
2020-12-28 11:06:57.440518: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 1157/10710 nodes to float16 precision using 113 cast(s) to float16 (excluding Const and Variable casts)
2020-12-28 11:06:59.001070: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 0/8830 nodes to float16 precision using 0 cast(s) to float16 (excluding Const and Variable casts)
Traceback (most recent call last):
File "examples/fastspeech2/train_fastspeech2.py", line 416, in
Errors may have originated from an input operation. Input Source operations connected to node tf_fast_speech2/add_1: tf_fastspeech2/encoder/layer._3/mul (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_tts/models/fastspeech.py:411)
Input Source operations connected to node tf_fast_speech2/add_1: tf_fastspeech2/encoder/layer._3/mul (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_tts/models/fastspeech.py:411)
Function call stack: _one_step_forward -> _one_step_forward
Let me try to reproduce the bug, i will run from A-Z based on the repo's README.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Hi, first of all, I want to thank you for providing this great framework, this is really help me. Second, I want to ask about the problem I encounter while trying to training fastspeech2. I want to try it with my own dataset, the dataset is in ljspeech format, the only different is the sample rate which is 16K instead of 22.5K. I did the prepocessing with said sample rate, doing duration extraction using mfa (because I dont have tacotron2 model), and training fastspeech2 using the default hyper parameter (fastpeech2.v1.yaml). I follow all the steps in mfa_extraction example, nothing wrong with that, no errors or warnings. But when I try to start training the fastspeech2 model, it could not start, and said something about incompatible shapes. I tried to update the TensorflowTTS package and now its version 0.9, but the error is still the same. This is the complete trace of the error.
Can anyone help me to point out my mistake? I really apreciate any feedback. Thank you. The tensorflow version I use is 2.3 and the TensorflowTTS version is 0.9.