Incompatible shapes when training Fastspeech2 #310

junaedifahmi commented 4 years ago

Hi, first of all, I want to thank you for providing this great framework, this is really help me. Second, I want to ask about the problem I encounter while trying to training fastspeech2. I want to try it with my own dataset, the dataset is in ljspeech format, the only different is the sample rate which is 16K instead of 22.5K. I did the prepocessing with said sample rate, doing duration extraction using mfa (because I dont have tacotron2 model), and training fastspeech2 using the default hyper parameter (fastpeech2.v1.yaml). I follow all the steps in mfa_extraction example, nothing wrong with that, no errors or warnings. But when I try to start training the fastspeech2 model, it could not start, and said something about incompatible shapes. I tried to update the TensorflowTTS package and now its version 0.9, but the error is still the same. This is the complete trace of the error.

CUDA_VISIBLE_DEVICES=0 python examples/fastspeech2/train_fastspeech2.py --train-dir ljspeech/train --dev-dir ljspeech/valid --outdir examples/fastspeech2/exp/train.fastspeech2.v1 --config examples/fastspeech2/conf/fastspeech2.v1.yaml --use-norm 0 --f0-stat ljspeech/stats_f0.npy --energy-stat ljspeech/stats_energy.npy --mixed_precision 1 --resume ""
2020-10-16 07:41:15,693 (train_fastspeech2:302) INFO: hop_size = 256
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: format = npy
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: model_type = fastspeech2
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: fastspeech2_params = {'n_speakers': 1, 'encoder_hidden_size': 384, 'encoder_num_hidden_layers': 4, 'encoder_num_attention_heads': 2, 'encoder_attention_head_size': 192, 'encoder_intermediate_size': 1024, 'encoder_intermediate_kernel_size': 3, 'encoder_hidden_act': 'mish', 'decoder_hidden_size': 384, 'decoder_num_hidden_layers': 4, 'decoder_num_attention_heads': 2, 'decoder_attention_head_size': 192, 'decoder_intermediate_size': 1024, 'decoder_intermediate_kernel_size': 3, 'decoder_hidden_act': 'mish', 'variant_prediction_num_conv_layers': 2, 'variant_predictor_filter': 256, 'variant_predictor_kernel_size': 3, 'variant_predictor_dropout_rate': 0.5, 'num_mels': 80, 'hidden_dropout_prob': 0.2, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 2048, 'initializer_range': 0.02, 'output_attentions': False, 'output_hidden_states': False}
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: batch_size = 16
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: remove_short_samples = True
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: allow_cache = True
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: mel_length_threshold = 32
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: is_shuffle = True
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: optimizer_params = {'initial_learning_rate': 0.001, 'end_learning_rate': 5e-05, 'decay_steps': 150000, 'warmup_proportion': 0.02, 'weight_decay': 0.001}
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: var_train_expr = None
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: train_max_steps = 200000
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: save_interval_steps = 5000
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: eval_interval_steps = 500
2020-10-16 07:41:15,694 (train_fastspeech2:302) INFO: log_interval_steps = 200
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: num_save_intermediate_results = 1
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: train_dir = ljspeech/train
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: dev_dir = ljspeech/valid
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: use_norm = False
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: f0_stat = ljspeech/stats_f0.npy
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: energy_stat = ljspeech/stats_energy.npy
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: outdir = examples/fastspeech2/exp/train.fastspeech2.v1
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: config = examples/fastspeech2/conf/fastspeech2.v1.yaml
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: resume = 
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: verbose = 1
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: mixed_precision = True
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: pretrained = 
2020-10-16 07:41:15,695 (train_fastspeech2:302) INFO: version = 0.9
Model: "tf_fast_speech2"
Layer (type)                 Output Shape              Param #   
embeddings (TFFastSpeechEmbe multiple                  844032    
encoder (TFFastSpeechEncoder multiple                  11814400  
length_regulator (TFFastSpee multiple                  0         
decoder (TFFastSpeechDecoder multiple                  12601216  
mel_before (Dense)           multiple                  30800     
postnet (TFTacotronPostnet)  multiple                  4352400   
f0_predictor (TFFastSpeechVa multiple                  493313    
energy_predictor (TFFastSpee multiple                  493313    
duration_predictor (TFFastSp multiple                  493313    
f0_embeddings (Conv1D)       multiple                  3840      
dropout_32 (Dropout)         multiple                  0         
energy_embeddings (Conv1D)   multiple                  3840      
dropout_33 (Dropout)         multiple                  0         
Total params: 31,130,467
Trainable params: 29,552,579
Non-trainable params: 1,577,888
Traceback (most recent call last):
  File "examples/fastspeech2/train_fastspeech2.py", line 416, in <module>
  File "examples/fastspeech2/train_fastspeech2.py", line 408, in main
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/trainers/base_trainer.py", line 852, in fit
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/trainers/base_trainer.py", line 101, in run
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/trainers/base_trainer.py", line 123, in _train_epoch
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/trainers/base_trainer.py", line 666, in _train_step
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 840, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  Incompatible shapes: [16,103,384] vs. [16,121,384]
     [[node tf_fast_speech2/add_1 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_tts/models/fastspeech2.py:181) ]]
  (1) Invalid argument:  Incompatible shapes: [16,103,384] vs. [16,121,384]
     [[node tf_fast_speech2/add_1 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_tts/models/fastspeech2.py:181) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference__one_step_forward_31726]

Errors may have originated from an input operation.
Input Source operations connected to node tf_fast_speech2/add_1:
 tf_fast_speech2/encoder/layer_._3/mul (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_tts/models/fastspeech.py:391)

Input Source operations connected to node tf_fast_speech2/add_1:
 tf_fast_speech2/encoder/layer_._3/mul (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_tts/models/fastspeech.py:391)

Function call stack:
_one_step_forward -> _one_step_forward

[train]:   0%|                                                                                                                                  | 0/200000 [05:48<?, ?it/s]

Can anyone help me to point out my mistake? I really apreciate any feedback. Thank you. The tensorflow version I use is 2.3 and the TensorflowTTS version is 0.9.

OscarVanL commented 4 years ago

When you prepared your dataset did you fix the frame differences?

See this readme. The specific script is at the bottom, fix_mismatch.py

geneing commented 4 years ago

Looks like the error is in the encoder. I believeI had this problem when I was using incompatible ljspeech_mapper.json when preparing data and when training. In my case it was a result of having both pip installed version, and an updated git source version on the same computer.

junaedifahmi commented 3 years ago

When you prepared your dataset did you fix the frame differences?

See this readme. The specific script is at the bottom, fix_mismatch.py

I did it and I think it works fine. This is the output that I got from doing that

root@aa1de9a7d6a4:/workspace# python examples/mfa_extraction/fix_mismatch.py --base_path ./ljspeech/ --trimmed_dur_path ./LJSpeech_ln/trimmed-durations --dur_path LJSpeech_ln/durations/
 2020-10-20 07:55:28,464 (fix_mismatch:46) INFO: FIXING train set ...                                                                                                                                                                                                                                                                                                                                      100%|█████████████████████████████████████████████████████| 22202/22202 [00:29<00:00, 742.88it/s]
2020-10-20 07:55:58,377 (fix_mismatch:107) INFO: train stats: number of mfa with longer duration: 11731, total diff: 36139, mean diff: 3.0806410365697725                                                                                                                                                
2020-10-20 07:55:58,380 (fix_mismatch:111) INFO: train stats: number of mfa with shorter duration: 6899, total diff: 12333, mean diff: 1.7876503841136397                                            2020-10-20 07:55:58,380 (fix_mismatch:115) INFO: train stats: number of files with a ''big'' duration diff: 0 if number>1 you should check it                                                        2020-10-20 07:55:58,380 (fix_mismatch:117) INFO: train stats: not fixed len: 0
                                                                                                                                                                                                                                                                                                                  2020-10-20 07:55:58,382 (fix_mismatch:46) INFO: FIXING valid set ...                                                                                                                                                                                                                                                                                                                                      100%|███████████████████████████████████████████████████████| 1169/1169 [00:09<00:00, 129.21it/s]                                                                                                    
2020-10-20 07:56:07,433 (fix_mismatch:107) INFO: valid stats: number of mfa with longer duration: 613, total diff: 1980, mean diff: 3.230016313213703                                                2020-10-20 07:56:07,433 (fix_mismatch:111) INFO: valid stats: number of mfa with shorter duration: 361, total diff: 669, mean diff: 1.853185595567867                                                
2020-10-20 07:56:07,433 (fix_mismatch:115) INFO: valid stats: number of files with a ''big'' duration diff: 0 if number>1 you should check it                                                        2020-10-20 07:56:07,434 (fix_mismatch:117) INFO: valid stats: not fixed len: 0 

But the problem is still the same.

Nistrian commented 3 years ago

@juunnn Hi, I have the same problem as yours, have you found a solution?

dathudeptrai commented 3 years ago

a bug tell that the length of character/phoneme input is different compared with f0/energy embeddings. You guys should print the shape and check this line (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/fastspeech2.py#L185). To enable debug mode, you should add tf.config.run_functions_eagerly(True) in the head of the training file to enable eager mode.

junaedifahmi commented 3 years ago

@juunnn Hi, I have the same problem as yours, have you found a solution?

To use mfa extraction I don't have a clue, but I can move forward by first training tacotron2 and use the extract duration there. Surprisingly my tacotron2 training can produce good sound with only 25 epoch.

martinXie commented 3 years ago

dathudeptrai commented 3 years ago

Let me try to reproduce the bug, i will run from A-Z based on the repo's README.

