Error when training fastSpeech2: Incompatible shapes: [16,132,384] vs. [16,192,384]

janbijster commented 3 years ago

Hi, first of all: thank you very much for this valuable repo!

When I train the fastspeech2 model on my data, I keep running into the same error:

2021-03-15 12:26:05.475500: W tensorflow/core/kernels/data/cache_dataset_ops.cc:798] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
Traceback (most recent call last):
  File "train_fastspeech2.py", line 417, in <module>
    main()
  File "train_fastspeech2.py", line 409, in main
    resume=args.resume,
  File "/home/orator/Desktop/Training/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 1010, in fit
    self.run()
  File "/home/orator/Desktop/Training/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 104, in run
    self._train_epoch()
  File "/home/orator/Desktop/Training/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 126, in _train_epoch
    self._train_step(batch)
  File "/home/orator/Desktop/Training/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 782, in _train_step
    self.one_step_forward(batch)
  File "/home/orator/anaconda3/envs/TTSTrain/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/home/orator/anaconda3/envs/TTSTrain/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/orator/anaconda3/envs/TTSTrain/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/orator/anaconda3/envs/TTSTrain/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/home/orator/anaconda3/envs/TTSTrain/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/orator/anaconda3/envs/TTSTrain/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/home/orator/anaconda3/envs/TTSTrain/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  Incompatible shapes: [16,31,384] vs. [16,50,384]
     [[node tf_fast_speech2/add_1 (defined at /home/orator/Desktop/Training/TensorFlowTTS/tensorflow_tts/models/fastspeech2.py:185) ]] [Op:__inference__one_step_forward_28592]

Errors may have originated from an input operation.
Input Source operations connected to node tf_fast_speech2/add_1:
 tf_fast_speech2/encoder/layer_._3/mul (defined at /home/orator/Desktop/Training/TensorFlowTTS/tensorflow_tts/models/fastspeech.py:411)

Function call stack:
_one_step_forward

I don't believe the message about cache is the problem, this disappears when I turn off allow_cache.

I think the relevant line is tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [16,31,384] vs. [16,50,384]

If I shuffle the data, the middle numbers (31 & 50) in the error change, (different sample) but they always differ by a factor around ~ 1.5.

I checked the durations of the data: for all samples, the sum of elements in ...durations.npy is equal to the size of the first dimension of ..norm-feats.npy and ...raw-energy.npy and ...raw-f0.npy.

My sound samples have a sample rate of 16kHz.

For extracting durations, I used the examples/mfa_extraction/ scripts and followed the steps in the readme. When extracting durations, I ran txt_grid_parser.py with --sample-rate 16000.

I used the following configuration for preprocessing:

###########################################################base_preprocess
#                FEATURE EXTRACTION SETTING               #
###########################################################
sampling_rate: 16000     # Sampling rate.
fft_size: 1024           # FFT size.
hop_size: 200            # Hop size. (fixed value, don't change)
win_length: null         # Window length.
                         # If set to null, it will be the same as fft_size.
window: "hann"           # Window function.
num_mels: 80             # Number of mel basis.
fmin: 80                 # Minimum freq in mel basis calculation.
fmax: 7600               # Maximum frequency in mel basis calculation.
global_gain_scale: 1.0   # Will be multiplied to all of waveform.
trim_silence: false #true       # Whether to trim the start and end of silence.
trim_threshold_in_db: 60 # Need to tune carefully if the recording is not good.
trim_frame_size: 2048    # Frame size in trimming.
trim_hop_size: 512       # Hop size in trimming.
format: "npy"            # Feature file format. Only "npy" is supported.
trim_mfa: true

Then I ran the preprocess and normalization steps and then fix_mismatch.

Then I tried training with the following configuration: (sorry for the wall of text)

# This is the hyperparameter configuration file for FastSpeech2 v1.
# Please make sure this is adjusted for the LJSpeech dataset. If you want to
# apply to the other dataset, you might need to carefully change some parameters.
# This configuration performs 200k iters but a best checkpoint is around 150k iters.

###########################################################
#                FEATURE EXTRACTION SETTING               #
###########################################################
hop_size: 200            # Hop size.
format: "npy"

###########################################################
#              NETWORK ARCHITECTURE SETTING               #
###########################################################
model_type: "fastspeech2"

fastspeech2_params:
    n_speakers: 1
    encoder_hidden_size: 384
    encoder_num_hidden_layers: 4
    encoder_num_attention_heads: 2
    encoder_attention_head_size: 192  # hidden_size // num_attention_heads
    encoder_intermediate_size: 1024
    encoder_intermediate_kernel_size: 3
    encoder_hidden_act: "mish"
    decoder_hidden_size: 384
    decoder_num_hidden_layers: 4
    decoder_num_attention_heads: 2
    decoder_attention_head_size: 192  # hidden_size // num_attention_heads
    decoder_intermediate_size: 1024
    decoder_intermediate_kernel_size: 3
    decoder_hidden_act: "mish"
    variant_prediction_num_conv_layers: 2
    variant_predictor_filter: 256
    variant_predictor_kernel_size: 3
    variant_predictor_dropout_rate: 0.5
    num_mels: 80
    hidden_dropout_prob: 0.2
    attention_probs_dropout_prob: 0.1
    max_position_embeddings: 2048
    initializer_range: 0.02
    output_attentions: False
    output_hidden_states: False

###########################################################
#                  DATA LOADER SETTING                    #
###########################################################
batch_size: 16              # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
remove_short_samples: false  # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.
mel_length_threshold: 32    # remove all targets has mel_length <= 32 
is_shuffle: false            # shuffle dataset after each epoch.
###########################################################
#             OPTIMIZER & SCHEDULER SETTING               #
###########################################################
optimizer_params:
    initial_learning_rate: 0.001
    end_learning_rate: 0.00005
    decay_steps: 150000          # < train_max_steps is recommend.
    warmup_proportion: 0.02
    weight_decay: 0.001

gradient_accumulation_steps: 1
var_train_expr: null  # trainable variable expr (eg. 'embeddings|encoder|decoder' )
                      # must separate by |. if var_train_expr is null then we 
                      # training all variable
###########################################################
#                    INTERVAL SETTING                     #
###########################################################
train_max_steps: 200000               # Number of training steps.
save_interval_steps: 5000             # Interval steps to save checkpoint.
eval_interval_steps: 500              # Interval steps to evaluate the network.
log_interval_steps: 200               # Interval steps to record the training log.
###########################################################
#                     OTHER SETTING                       #
###########################################################
num_save_intermediate_results: 1  # Number of batch to be saved as intermediate results.

I tried with remove_short_samples and trim_silence on and off, tried with Tensorflow 2.4 and 2.3, GPU and CPU, but no luck. I also tried with another dataset, a subset of libritts. With this I changed the samplerate to 24000 and hop_size to 300. But I run into the same Incompatible shapes: error.

Do you have any idea what could cause this?

dathudeptrai commented 3 years ago

@janbijster https://github.com/TensorSpeech/TensorFlowTTS/issues/512

janbijster commented 3 years ago

I think it has to do with the fact that I extracted duration with the mfa_extraction method.

After inspecting the samples, I get the idea that the error is caused by the input to the model, so the ids denoting the characters/phonemes. The ids in ids.npy seem to correspond to the charactors in the text line, while the durations seem to correspond to phonemes.

For example: the first utterance has 50 charactors, that are converted by the MFA to 31 phonemes. The ids.npy file (generated by tensorflow-tts-preprocess) contains 50 elements, while the durations.npy file (generated by MFA) contains 31 elements.

janbijster commented 3 years ago

I can now confirm this was the error. I extracted durations using a pretrained tacotron2 model and these had the right number of elements.

TensorSpeech / TensorFlowTTS

Error when training fastSpeech2: Incompatible shapes: [16,132,384] vs. [16,192,384] #518