Libritts 100 speakers fail to train

shachar-ug commented 3 years ago

Changing the code to train on 100 speakers, seems that training fails due to attention map

Error seems to be in fastspeech

In fastspeec.py, line 413 I've printed some debugging shapes and got

                masked_layer_output = layer_output * tf.cast(
                    tf.expand_dims(attention_mask, 2), dtype=layer_output.dtype
                )
            else:
                print("attention_mask=", attention_mask)
                e = tf.expand_dims(attention_mask, 2)   
                print("e exapnded shape=", e.shape)
                e = tf.cast(e, dtype=layer_output.dtype)
                print("e=", e)
                masked_layer_output = layer_output * e

the following output shape: shape=(None, None, 1)

Preprocessed with Libri experiment ipynb + MFA (wiithout tactron2)

the output error:

Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.ex
perimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`.
[train]:   0%|                                                                                                                                                          | 0/150000 [00:00<?, ?it/s]
2021-09-16 17:45:06.346627: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-09-16 17:45:16.375194: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:175] Filling up shuffle buffer (this may take a while): 1472 of 10804
2021-09-16 17:45:26.370073: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:175] Filling up shuffle buffer (this may take a while): 2927 of 10804
2021-09-16 17:45:36.369021: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:175] Filling up shuffle buffer (this may take a while): 4386 of 10804
2021-09-16 17:45:46.377340: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:175] Filling up shuffle buffer (this may take a while): 5814 of 10804
2021-09-16 17:45:56.377237: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:175] Filling up shuffle buffer (this may take a while): 7303 of 10804
2021-09-16 17:46:06.394587: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:175] Filling up shuffle buffer (this may take a while): 8764 of 10804
2021-09-16 17:46:16.367909: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:175] Filling up shuffle buffer (this may take a while): 10253 of 10804
2021-09-16 17:46:20.175820: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:228] Shuffle buffer filled.
2021-09-16 17:46:46.571537: W tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:2025] No (suitable) GPUs detected, skipping auto_mixed_precision_cuda graph optimizer
2021-09-16 17:46:48.050613: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { d
type: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla K80" frequency: 823 num_cores: 13 environment { key: "architecture" value: "3.7" } environment { 
key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 131072 l1_cache_size: 16384 l2_cache_size: 1572864 shared_memory_size_per_multiprocessor: 114688 memory_size
: 11326390272 bandwidth: 240480000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-09-16 17:46:48.054867: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { d
type: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla K80" frequency: 823 num_cores: 13 environment { key: "architecture" value: "3.7" } environment { 
key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 131072 l1_cache_size: 16384 l2_cache_size: 1572864 shared_memory_size_per_multiprocessor: 114688 memory_size
: 11326390272 bandwidth: 240480000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-09-16 17:46:48.058660: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { d
type: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla K80" frequency: 823 num_cores: 13 environment { key: "architecture" value: "3.7" } environment { 
key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 131072 l1_cache_size: 16384 l2_cache_size: 1572864 shared_memory_size_per_multiprocessor: 114688 memory_size: 11326390272 bandwidth: 240480000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-09-16 17:46:48.062951: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla K80" frequency: 823 num_cores: 13 environment { key: "architecture" value: "3.7" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 131072 l1_cache_size: 16384 l2_cache_size: 1572864 shared_memory_size_per_multiprocessor: 114688 memory_size: 11326390272 bandwidth: 240480000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-09-16 17:46:49.421675: W tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:2025] No (suitable) GPUs detected, skipping auto_mixed_precision_cuda graph optimizer
2021-09-16 17:46:51.646446: W tensorflow/core/framework/op_kernel.cc:1680] Invalid argument: required broadcastable shapes
Traceback (most recent call last):

  File "ttsexamples/fastspeech2_libritts/train_fastspeech2.py", line 490, in <module>
    main()
  File "ttsexamples/fastspeech2_libritts/train_fastspeech2.py", line 482, in main
    resume=args.resume,
  File "/home/jupyter/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 1010, in fit
    self.run()
  File "/home/jupyter/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 104, in run
    self._train_epoch()
  File "/home/jupyter/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 126, in _train_epoch
    self._train_step(batch)
  File "/home/jupyter/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 782, in _train_step
    self.one_step_forward(batch)
  File "/opt/conda/envs/tensortts/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 885, in __call__
    result = self._call(*args, **kwds)
  File "/opt/conda/envs/tensortts/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "/opt/conda/envs/tensortts/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3040, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "/opt/conda/envs/tensortts/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1964, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/opt/conda/envs/tensortts/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 596, in call
    ctx=ctx)
  File "/opt/conda/envs/tensortts/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  required broadcastable shapes
         [[node tf_fast_speech2/add_1 (defined at /home/jupyter/TensorFlowTTS/tensorflow_tts/models/fastspeech2.py:185) ]]
         [[tf_fast_speech2/length_regulator/while/LoopCond/_92/_136]]
  (1) Invalid argument:  required broadcastable shapes
         [[node tf_fast_speech2/add_1 (defined at /home/jupyter/TensorFlowTTS/tensorflow_tts/models/fastspeech2.py:185) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference__one_step_forward_33805]

Errors may have originated from an input operation.
Input Source operations connected to node tf_fast_speech2/add_1:
 tf_fast_speech2/encoder/layer_._3/mul (defined at /home/jupyter/TensorFlowTTS/tensorflow_tts/models/fastspeech.py:413)

Input Source operations connected to node tf_fast_speech2/add_1:
 tf_fast_speech2/encoder/layer_._3/mul (defined at /home/jupyter/TensorFlowTTS/tensorflow_tts/models/fastspeech.py:413)

Function call stack:
_one_step_forward -> _one_step_forward

[train]:   0%|                                                                                                                                                          | 0/150000 [01:46<?, ?it/s]

Zhang-Nian commented 3 years ago

Two monthes ago, That code was ok on libritts corpus, but now I have the same problem, and I don't know what happened !

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Tian14267 commented 2 years ago

Did you ever solve this problem ? @Zhang-Nian @shachar-ug

qq492947833 commented 1 year ago

I have same question，model can be build，but it cant be .fit，same error，have anybody can fix it？

TensorSpeech / TensorFlowTTS

Libritts 100 speakers fail to train #672