espnet / espnet

End-to-End Speech Processing Toolkit
https://espnet.github.io/espnet/
Apache License 2.0
8.48k stars 2.18k forks source link

Errors when finetuning CFS2 + HifiGan #4331

Closed alexbernier closed 2 years ago

alexbernier commented 2 years ago

Hi,

I traine my own Tacotron, Conformer FastSpeech 2 and HiFiGAN models, and now I would like to finetune CFS2 + HiFiGAN.

When following the instructions in egs2/TEMPLATE/tts1/README.md, I get the following error:

$ ./run2.sh --stage 6 --stop-stage 6 --tts_task gan_tts --train_config ./conf/tuning/finetune_joint_conformer_fastspeech2_hifigan.yaml
...
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/espnet-jets/espnet2/bin/gan_tts_train.py", line 22, in <module>
    main()
  File "/data/espnet-jets/espnet2/bin/gan_tts_train.py", line 18, in main
    GANTTSTask.main(cmd=cmd)
  File "/data/espnet-jets/espnet2/tasks/abs_task.py", line 1019, in main
    cls.main_worker(args)
  File "/data/espnet-jets/espnet2/tasks/abs_task.py", line 1121, in main_worker
    model = cls.build_model(args=args)
  File "/data/espnet-jets/espnet2/tasks/gan_tts.py", line 334, in build_model
    pitch_normalize = pitch_normalize_class(
TypeError: __init__() missing 1 required positional argument: 'stats_file'
# Accounting: time=5 threads=1
# Ended (code 1) at Sun May  1 09:12:49 UTC 2022, elapsed time 5 seconds

So I tried:

$ ./run2.sh --stage 6 --stop-stage 6 --tts_task gan_tts --train_config ./conf/tuning/finetune_joint_conformer_fastspeech2_hifigan.yaml --tts_stats_dir exp/tts_train_tacotron2_raw_phn_none/decode_use_teacher_forcingtrue_106epoch/stats 
...
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/espnet-jets/espnet2/bin/gan_tts_train.py", line 22, in <module>
    main()
  File "/data/espnet-jets/espnet2/bin/gan_tts_train.py", line 18, in main
    GANTTSTask.main(cmd=cmd)
  File "/data/espnet-jets/espnet2/tasks/abs_task.py", line 1019, in main
    cls.main_worker(args)
  File "/data/espnet-jets/espnet2/tasks/abs_task.py", line 1229, in main_worker
    load_pretrained_model(
  File "/data/espnet-jets/espnet2/torch_utils/load_pretrained_model.py", line 117, in load_pretrained_model
    obj.load_state_dict(dst_state)
  File "/home/alex/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1482, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for FastSpeech2:
    Unexpected key(s) in state_dict: "encoder.after_norm.weight", "encoder.after_norm.bias", "decoder.after_norm.weight", "decoder.after_norm.bias". 

What did I wrong ?

Basic environments:

kan-bayashi commented 2 years ago

The hyperparameters of the pretrained model were mismatched with the joint model configuration. Change the following parameters: https://github.com/espnet/espnet/blob/b757b89d45d5574cebf44e225cbe32e3e9e4f522/egs2/ljspeech/tts1/conf/tuning/train_joint_conformer_fastspeech2_hifigan.yaml#L33-L34