TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.8k stars 810 forks source link

Fastspeech 2 phoneme 'Invalid argument: Incompatible shapes' during training #685

Closed psalajka closed 2 years ago

psalajka commented 2 years ago

Hello, I'm trying to train a phoneme-based FS2 model. Our dataset has the same structure (and settings) as the LJ Speech Dataset, therefore I use its configs. We're generating our own durations, so I triple-checked everything.

This issue is probably closely related to https://github.com/TensorSpeech/TensorFlowTTS/issues/518 and https://github.com/TensorSpeech/TensorFlowTTS/issues/512, for that I tried everything mentioned there.

Checking data with

for data in tqdm(train_dataset):
    # debug 1 step forward
    try:
        outputs = fastspeech2(**data)
    except:
        print(data["utt_ids"])

didn't print any error (neither for validation_dataset). I put that code at examples/fastspeech2/train_fastspeech2.py:400 (before # start training).

My main problem is the training makes many steps before it dies (3627 steps last time). That means that several epochs were done. I attached details of the error in error.txt. Please note that some line numbers are a little different (due to debug prints), e.g. in

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  Incompatible shapes: [16,109,384] vs. [16,107,384]
         [node tf_fast_speech2/add_1 (defined at ./tensorflow_tts/models/fastspeech2.py:188) ]]
         [tf_fast_speech2/length_regulator/while/loop_body_control/_117/_135]]
  (1) Invalid argument:  Incompatible shapes: [16,109,384] vs. [16,107,384]
         [node tf_fast_speech2/add_1 (defined at ./tensorflow_tts/models/fastspeech2.py:188) ]]

188 should be 185. The difference is always very small, usually 2 steps as above.

I already realized the error line is:

last_encoder_hidden_states += f0_embedding + energy_embedding

The last_encoder_hidden_states shape is different from f0_em... and energy_em... shapes (which are the same).

Please, do you have an idea what could be wrong? I'm running out of mine... Thanks

psalajka commented 2 years ago

Update: The debugging tool

while True:
    for data in tqdm(train_dataset):
        # debug 1 step forward
        try:
            outputs = fastspeech2(**data)
        except:
            print(data["utt_ids"])¨

    for data in tqdm(valid_dataset):
        # debug 1 step forward
        try:
            outputs = fastspeech2(**data)
        except:
            print(data["utt_ids"])

generated after several (many) iterations some results. I already identified 3 samples that appeared at least 2 times in outputs and their removal increased training stability. I managed to make more than 6000 steps.

Unfortunately, I still don't understand what's happening. Now I try to fix the seed to achieve reproducibility.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.