TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.82k stars 812 forks source link

Incompatible shapes when training Fastspeech2 #310

Closed junaedifahmi closed 3 years ago

junaedifahmi commented 4 years ago

Hi, first of all, I want to thank you for providing this great framework, this is really help me. Second, I want to ask about the problem I encounter while trying to training fastspeech2. I want to try it with my own dataset, the dataset is in ljspeech format, the only different is the sample rate which is 16K instead of 22.5K. I did the prepocessing with said sample rate, doing duration extraction using mfa (because I dont have tacotron2 model), and training fastspeech2 using the default hyper parameter (fastpeech2.v1.yaml). I follow all the steps in mfa_extraction example, nothing wrong with that, no errors or warnings. But when I try to start training the fastspeech2 model, it could not start, and said something about incompatible shapes. I tried to update the TensorflowTTS package and now its version 0.9, but the error is still the same. This is the complete trace of the error.

CUDA_VISIBLE_DEVICES=0 python examples/fastspeech2/train_fastspeech2.py --train-dir ljspeech/train --dev-dir ljspeech/valid --outdir examples/fastspeech2/exp/train.fastspeech2.v1 --config examples/fastspeech2/conf/fastspeech2.v1.yaml --use-norm 0 --f0-stat ljspeech/stats_f0.npy --energy-stat ljspeech/stats_energy.npy --mixed_precision 1 --resume ""
2020-10-16 07:41:06.219401: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-16 07:41:07.269097: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-10-16 07:41:12.896881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:1a:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.6705GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2020-10-16 07:41:12.896987: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-16 07:41:12.902187: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-10-16 07:41:12.906126: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-10-16 07:41:12.906831: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-10-16 07:41:12.909965: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-10-16 07:41:12.911599: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-10-16 07:41:12.917678: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-10-16 07:41:12.919479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-10-16 07:41:15.032267: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-10-16 07:41:15.058653: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2100000000 Hz
2020-10-16 07:41:15.070659: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5399b90 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-16 07:41:15.070706: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-16 07:41:15.204399: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5318100 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-10-16 07:41:15.204480: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-10-16 07:41:15.205672: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:1a:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.6705GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2020-10-16 07:41:15.205717: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-16 07:41:15.205750: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-10-16 07:41:15.205773: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-10-16 07:41:15.205795: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-10-16 07:41:15.205814: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-10-16 07:41:15.205835: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-10-16 07:41:15.205856: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-10-16 07:41:15.207521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-10-16 07:41:15.207608: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-16 07:41:15.647428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-16 07:41:15.647468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2020-10-16 07:41:15.647474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2020-10-16 07:41:15.648892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10265 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:1a:00.0, compute capability: 6.1)
2020-10-16 07:41:15,693 (train_fastspeech2:302) INFO: hop_size = 256
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: format = npy
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: model_type = fastspeech2
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: fastspeech2_params = {'n_speakers': 1, 'encoder_hidden_size': 384, 'encoder_num_hidden_layers': 4, 'encoder_num_attention_heads': 2, 'encoder_attention_head_size': 192, 'encoder_intermediate_size': 1024, 'encoder_intermediate_kernel_size': 3, 'encoder_hidden_act': 'mish', 'decoder_hidden_size': 384, 'decoder_num_hidden_layers': 4, 'decoder_num_attention_heads': 2, 'decoder_attention_head_size': 192, 'decoder_intermediate_size': 1024, 'decoder_intermediate_kernel_size': 3, 'decoder_hidden_act': 'mish', 'variant_prediction_num_conv_layers': 2, 'variant_predictor_filter': 256, 'variant_predictor_kernel_size': 3, 'variant_predictor_dropout_rate': 0.5, 'num_mels': 80, 'hidden_dropout_prob': 0.2, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 2048, 'initializer_range': 0.02, 'output_attentions': False, 'output_hidden_states': False}
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: batch_size = 16
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: remove_short_samples = True
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: allow_cache = True
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: mel_length_threshold = 32
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: is_shuffle = True
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: optimizer_params = {'initial_learning_rate': 0.001, 'end_learning_rate': 5e-05, 'decay_steps': 150000, 'warmup_proportion': 0.02, 'weight_decay': 0.001}
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: var_train_expr = None
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: train_max_steps = 200000
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: save_interval_steps = 5000
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: eval_interval_steps = 500
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: log_interval_steps = 200
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: num_save_intermediate_results = 1
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: train_dir = ljspeech/train
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: dev_dir = ljspeech/valid
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: use_norm = False
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: f0_stat = ljspeech/stats_f0.npy
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: energy_stat = ljspeech/stats_energy.npy
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: outdir = examples/fastspeech2/exp/train.fastspeech2.v1
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: config = examples/fastspeech2/conf/fastspeech2.v1.yaml
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: resume = 
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: verbose = 1
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: mixed_precision = True
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: pretrained = 
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: version = 0.9
2020-10-16 07:41:23.747358: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-10-16 07:41:23.952583: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
Model: "tf_fast_speech2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embeddings (TFFastSpeechEmbe multiple                  844032    
_________________________________________________________________
encoder (TFFastSpeechEncoder multiple                  11814400  
_________________________________________________________________
length_regulator (TFFastSpee multiple                  0         
_________________________________________________________________
decoder (TFFastSpeechDecoder multiple                  12601216  
_________________________________________________________________
mel_before (Dense)           multiple                  30800     
_________________________________________________________________
postnet (TFTacotronPostnet)  multiple                  4352400   
_________________________________________________________________
f0_predictor (TFFastSpeechVa multiple                  493313    
_________________________________________________________________
energy_predictor (TFFastSpee multiple                  493313    
_________________________________________________________________
duration_predictor (TFFastSp multiple                  493313    
_________________________________________________________________
f0_embeddings (Conv1D)       multiple                  3840      
_________________________________________________________________
dropout_32 (Dropout)         multiple                  0         
_________________________________________________________________
energy_embeddings (Conv1D)   multiple                  3840      
_________________________________________________________________
dropout_33 (Dropout)         multiple                  0         
=================================================================
Total params: 31,130,467
Trainable params: 29,552,579
Non-trainable params: 1,577,888
_________________________________________________________________
[train]:   0%|                                                                                                                                  | 0/200000 [00:00<?, ?it/s]2020-10-16 07:41:25.693466: W tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1972] No (suitable) GPUs detected, skipping auto_mixed_precision_cuda graph optimizer
2020-10-16 07:41:25.706325: W tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1972] No (suitable) GPUs detected, skipping auto_mixed_precision_cuda graph optimizer
2020-10-16 07:41:35.699842: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 727 of 22202
2020-10-16 07:41:45.625203: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 1372 of 22202
2020-10-16 07:41:55.603418: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 2111 of 22202
2020-10-16 07:42:05.707307: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 2746 of 22202
2020-10-16 07:42:15.671866: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 3409 of 22202
2020-10-16 07:42:25.604134: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 4071 of 22202
2020-10-16 07:42:35.629239: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 4860 of 22202
2020-10-16 07:42:45.647567: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 5624 of 22202
2020-10-16 07:42:55.611273: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 6400 of 22202
2020-10-16 07:43:05.645959: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 7163 of 22202
2020-10-16 07:43:15.659361: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 7976 of 22202
2020-10-16 07:43:25.647359: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 8790 of 22202
2020-10-16 07:43:35.605123: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 9589 of 22202
2020-10-16 07:43:45.616455: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 10393 of 22202
2020-10-16 07:43:55.616363: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 11230 of 22202
2020-10-16 07:44:05.695576: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 11987 of 22202
2020-10-16 07:44:15.623326: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 12717 of 22202
2020-10-16 07:44:25.618868: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 13542 of 22202
2020-10-16 07:44:35.682692: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 14145 of 22202
2020-10-16 07:44:45.677202: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 14806 of 22202
2020-10-16 07:44:55.633085: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 15420 of 22202
2020-10-16 07:45:05.655808: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 16044 of 22202
2020-10-16 07:45:15.621795: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 16672 of 22202
2020-10-16 07:45:25.612364: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 17288 of 22202
2020-10-16 07:45:35.609333: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 17866 of 22202
2020-10-16 07:45:45.674751: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 18425 of 22202
2020-10-16 07:45:55.705538: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 19015 of 22202
2020-10-16 07:46:05.659905: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 19628 of 22202
2020-10-16 07:46:15.701749: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 20181 of 22202
2020-10-16 07:46:25.655440: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 20671 of 22202
2020-10-16 07:46:35.672903: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 21185 of 22202
2020-10-16 07:46:45.680295: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 21807 of 22202
2020-10-16 07:46:53.179570: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:221] Shuffle buffer filled.
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/indexed_slices.py:433: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2020-10-16 07:47:10.493995: W tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1972] No (suitable) GPUs detected, skipping auto_mixed_precision_cuda graph optimizer
2020-10-16 07:47:12.543917: W tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1972] No (suitable) GPUs detected, skipping auto_mixed_precision_cuda graph optimizer
Traceback (most recent call last):
  File "examples/fastspeech2/train_fastspeech2.py", line 416, in <module>
    main()
  File "examples/fastspeech2/train_fastspeech2.py", line 408, in main
    resume=args.resume,
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/trainers/base_trainer.py", line 852, in fit
    self.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/trainers/base_trainer.py", line 101, in run
    self._train_epoch()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/trainers/base_trainer.py", line 123, in _train_epoch
    self._train_step(batch)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/trainers/base_trainer.py", line 666, in _train_step
    self.one_step_forward(batch)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 840, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  Incompatible shapes: [16,103,384] vs. [16,121,384]
     [[node tf_fast_speech2/add_1 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_tts/models/fastspeech2.py:181) ]]
     [[tf_fast_speech2/length_regulator/while/LoopCond/_92/_132]]
  (1) Invalid argument:  Incompatible shapes: [16,103,384] vs. [16,121,384]
     [[node tf_fast_speech2/add_1 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_tts/models/fastspeech2.py:181) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference__one_step_forward_31726]

Errors may have originated from an input operation.
Input Source operations connected to node tf_fast_speech2/add_1:
 tf_fast_speech2/encoder/layer_._3/mul (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_tts/models/fastspeech.py:391)

Input Source operations connected to node tf_fast_speech2/add_1:
 tf_fast_speech2/encoder/layer_._3/mul (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_tts/models/fastspeech.py:391)

Function call stack:
_one_step_forward -> _one_step_forward

[train]:   0%|                                                                                                                                  | 0/200000 [05:48<?, ?it/s]

Can anyone help me to point out my mistake? I really apreciate any feedback. Thank you. The tensorflow version I use is 2.3 and the TensorflowTTS version is 0.9.

OscarVanL commented 4 years ago

When you prepared your dataset did you fix the frame differences?

See this readme. The specific script is at the bottom, fix_mismatch.py

geneing commented 4 years ago

Looks like the error is in the encoder. I believeI had this problem when I was using incompatible ljspeech_mapper.json when preparing data and when training. In my case it was a result of having both pip installed version, and an updated git source version on the same computer.

junaedifahmi commented 3 years ago

When you prepared your dataset did you fix the frame differences?

See this readme. The specific script is at the bottom, fix_mismatch.py

I did it and I think it works fine. This is the output that I got from doing that

root@aa1de9a7d6a4:/workspace# python examples/mfa_extraction/fix_mismatch.py --base_path ./ljspeech/ --trimmed_dur_path ./LJSpeech_ln/trimmed-durations --dur_path LJSpeech_ln/durations/
 2020-10-20 07:55:28,464 (fix_mismatch:46) INFO: FIXING train set ...                                                                                                                                                                                                                                                                                                                                      100%|█████████████████████████████████████████████████████| 22202/22202 [00:29<00:00, 742.88it/s]
2020-10-20 07:55:58,377 (fix_mismatch:107) INFO: train stats: number of mfa with longer duration: 11731, total diff: 36139, mean diff: 3.0806410365697725                                                                                                                                                
2020-10-20 07:55:58,380 (fix_mismatch:111) INFO: train stats: number of mfa with shorter duration: 6899, total diff: 12333, mean diff: 1.7876503841136397                                            2020-10-20 07:55:58,380 (fix_mismatch:115) INFO: train stats: number of files with a ''big'' duration diff: 0 if number>1 you should check it                                                        2020-10-20 07:55:58,380 (fix_mismatch:117) INFO: train stats: not fixed len: 0
                                                                                                                                                                                                                                                                                                                  2020-10-20 07:55:58,382 (fix_mismatch:46) INFO: FIXING valid set ...                                                                                                                                                                                                                                                                                                                                      100%|███████████████████████████████████████████████████████| 1169/1169 [00:09<00:00, 129.21it/s]                                                                                                    
2020-10-20 07:56:07,433 (fix_mismatch:107) INFO: valid stats: number of mfa with longer duration: 613, total diff: 1980, mean diff: 3.230016313213703                                                2020-10-20 07:56:07,433 (fix_mismatch:111) INFO: valid stats: number of mfa with shorter duration: 361, total diff: 669, mean diff: 1.853185595567867                                                
2020-10-20 07:56:07,433 (fix_mismatch:115) INFO: valid stats: number of files with a ''big'' duration diff: 0 if number>1 you should check it                                                        2020-10-20 07:56:07,434 (fix_mismatch:117) INFO: valid stats: not fixed len: 0 

But the problem is still the same.

Nistrian commented 3 years ago

@juunnn Hi, I have the same problem as yours, have you found a solution?

dathudeptrai commented 3 years ago

a bug tell that the length of character/phoneme input is different compared with f0/energy embeddings. You guys should print the shape and check this line (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/fastspeech2.py#L185). To enable debug mode, you should add tf.config.run_functions_eagerly(True) in the head of the training file to enable eager mode.

junaedifahmi commented 3 years ago

@juunnn Hi, I have the same problem as yours, have you found a solution?

To use mfa extraction I don't have a clue, but I can move forward by first training tacotron2 and use the extract duration there. Surprisingly my tacotron2 training can produce good sound with only 25 epoch.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

martinXie commented 3 years ago

i have same issue 2020-12-28 11:03:06,585 (train_fastspeech2:299) INFO: hop_size = 256 2020-12-28 11:03:06,585 (train_fastspeech2:299) INFO: format = npy 2020-12-28 11:03:06,585 (train_fastspeech2:299) INFO: model_type = fastspeech2 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: fastspeech2_params = {'n_speakers': 1, 'encoder_hidden_size': 384, 'encoder_num_hidden_layers': 4, 'encoder_num_attention_heads': 2, 'encoder_attention_head_size': 192, 'encoder_intermediate_size': 1024, 'encoder_intermediate_kernel_size': 3, 'encoder_hidden_act': 'mish', 'decoder_hidden_size': 384, 'decoder_num_hidden_layers': 4, 'decoder_num_attention_heads': 2, 'decoder_attention_head_size': 192, 'decoder_intermediate_size': 1024, 'decoder_intermediate_kernel_size': 3, 'decoder_hidden_act': 'mish', 'variant_prediction_num_conv_layers': 2, 'variant_predictor_filter': 256, 'variant_predictor_kernel_size': 3, 'variant_predictor_dropout_rate': 0.5, 'num_mels': 80, 'hidden_dropout_prob': 0.2, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 2048, 'initializer_range': 0.02, 'output_attentions': False, 'output_hidden_states': False} 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: batch_size = 16 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: remove_short_samples = True 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: allow_cache = True 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: mel_length_threshold = 32 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: is_shuffle = True 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: optimizer_params = {'initial_learning_rate': 0.001, 'end_learning_rate': 5e-05, 'decay_steps': 150000, 'warmup_proportion': 0.02, 'weight_decay': 0.001} 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: gradient_accumulation_steps = 1 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: var_train_expr = None 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: train_max_steps = 200000 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: save_interval_steps = 5000 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: eval_interval_steps = 500 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: log_interval_steps = 200 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: num_save_intermediate_results = 1 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: train_dir = ./dump_ljspeech/train/ 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: dev_dir = ./dump_ljspeech/valid/ 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: use_norm = True 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: f0_stat = ./dump_ljspeech/stats_f0.npy 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: energy_stat = ./dump_ljspeech/stats_energy.npy 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: outdir = ./examples/fastspeech2/exp/train.fastspeech2.v1/ 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: config = ./examples/fastspeech2/conf/fastspeech2.v1.yaml 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: resume = 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: verbose = 1 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: mixed_precision = True 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: pretrained = 2020-12-28 11:03:06,586 (train_fastspeech2:299) INFO: version = 0.0 2020-12-28 11:03:07,557 (api:603) INFO: charactor Tensor("PyFunc_1:0", dtype=int32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,094 (api:603) INFO: duration Tensor("PyFunc_2:0", dtype=int32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,098 (api:603) INFO: f0 Tensor("PyFunc_7:0", dtype=float32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,103 (api:603) INFO: energy Tensor("PyFunc_8:0", dtype=float32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,108 (api:603) INFO: mel Tensor("PyFunc:0", dtype=float32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,183 (api:603) INFO: charactor Tensor("PyFunc_1:0", dtype=int32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,188 (api:603) INFO: duration Tensor("PyFunc_2:0", dtype=int32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,193 (api:603) INFO: f0 Tensor("PyFunc_7:0", dtype=float32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,198 (api:603) INFO: energy Tensor("PyFunc_8:0", dtype=float32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:08,202 (api:603) INFO: mel Tensor("PyFunc:0", dtype=float32, device=/job:localhost/replica:0/task:0) 2020-12-28 11:03:10.603306: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2020-12-28 11:03:10.735770: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 f0_embedding (1, 10, 384) energey_embedding (1, 10, 384) Model: "tf_fast_speech2"


Layer (type) Output Shape Param #

embeddings (TFFastSpeechEmbe multiple 844032


encoder (TFFastSpeechEncoder multiple 11814400


length_regulator (TFFastSpee multiple 0


decoder (TFFastSpeechDecoder multiple 12601216


mel_before (Dense) multiple 30800


postnet (TFTacotronPostnet) multiple 4352400


f0_predictor (TFFastSpeechVa multiple 493313


energy_predictor (TFFastSpee multiple 493313


duration_predictor (TFFastSp multiple 493313


f0_embeddings (Conv1D) multiple 3840


dropout_32 (Dropout) multiple 0


energy_embeddings (Conv1D) multiple 3840


dropout_33 (Dropout) multiple 0

Total params: 31,130,467 Trainable params: 29,552,579 Non-trainable params: 1,577,888


train: 0%| | 0/200000 [00:00<?, ?it/s]2020-12-28 11:03:11.788850: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No whitelist ops found, nothing to do 2020-12-28 11:03:11.793360: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No whitelist ops found, nothing to do 2020-12-28 11:03:21.779227: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 508 of 12445 2020-12-28 11:03:31.782370: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 983 of 12445 2020-12-28 11:03:41.807584: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 1600 of 12445 2020-12-28 11:03:51.757672: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 2320 of 12445 2020-12-28 11:04:01.761795: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 2983 of 12445 2020-12-28 11:04:11.825262: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 3440 of 12445 2020-12-28 11:04:21.895227: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 3927 of 12445 2020-12-28 11:04:31.810633: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 4470 of 12445 2020-12-28 11:04:41.767607: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 5108 of 12445 2020-12-28 11:04:51.810037: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 5574 of 12445 2020-12-28 11:05:01.835467: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 6060 of 12445 2020-12-28 11:05:11.777472: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 6557 of 12445 2020-12-28 11:05:21.782114: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 7036 of 12445 2020-12-28 11:05:31.872478: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 7485 of 12445 2020-12-28 11:05:41.853084: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 7938 of 12445 2020-12-28 11:05:51.867180: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 8709 of 12445 2020-12-28 11:06:01.797630: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 9654 of 12445 2020-12-28 11:06:11.759848: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 10172 of 12445 2020-12-28 11:06:21.800689: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 10910 of 12445 2020-12-28 11:06:31.767144: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 11603 of 12445 2020-12-28 11:06:41.762450: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 12247 of 12445 2020-12-28 11:06:44.438570: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:221] Shuffle buffer filled. f0_embedding (16, None, 384) energey_embedding (16, None, 384) /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/indexed_slices.py:433: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " f0_embedding (16, None, 384) energey_embedding (16, None, 384) 2020-12-28 11:06:57.440518: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 1157/10710 nodes to float16 precision using 113 cast(s) to float16 (excluding Const and Variable casts) 2020-12-28 11:06:59.001070: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 0/8830 nodes to float16 precision using 0 cast(s) to float16 (excluding Const and Variable casts) Traceback (most recent call last): File "examples/fastspeech2/train_fastspeech2.py", line 416, in main() File "examples/fastspeech2/train_fastspeech2.py", line 408, in main resume=args.resume, File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/trainers/base_trainer.py", line 999, in fit self.run() File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/trainers/base_trainer.py", line 103, in run self._train_epoch() File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/trainers/base_trainer.py", line 125, in _train_epoch self._train_step(batch) File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/trainers/base_trainer.py", line 777, in _train_step self.one_step_forward(batch) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 780, in call result = self._call(*args, *kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 840, in _call return self._stateless_fn(args, **kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2829, in call return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call cancellation_manager=cancellation_manager) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call ctx=ctx) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Incompatible shapes: [16,142,384] vs. [16,143,384] [[node tf_fast_speech2/add_1 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_tts/models/fastspeech2.py:186) ]] [[tf_fast_speech2/length_regulator/while/loop_body_control/_117/_135]] (1) Invalid argument: Incompatible shapes: [16,142,384] vs. [16,143,384] [[node tf_fast_speech2/add_1 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_tts/models/fastspeech2.py:186) ]] 0 successful operations. 0 derived errors ignored. [Op:inferenceone_step_forward_33304]

Errors may have originated from an input operation. Input Source operations connected to node tf_fast_speech2/add_1: tf_fastspeech2/encoder/layer._3/mul (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_tts/models/fastspeech.py:411)

Input Source operations connected to node tf_fast_speech2/add_1: tf_fastspeech2/encoder/layer._3/mul (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_tts/models/fastspeech.py:411)

Function call stack: _one_step_forward -> _one_step_forward

dathudeptrai commented 3 years ago

Let me try to reproduce the bug, i will run from A-Z based on the repo's README.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.