[Bug] ValueError when running inference with trained OverFlow model

Ca-ressemble-a-du-fake commented 1 year ago

Describe the bug

Hi,

I could train OverFlow model from scratch on my own dataset (22050 Hz samples). But when I try to check its output via tts --text "Bonjour les amis" --model_path /home/caraduf/Models/Overflow/Test_Overflow_22kHz-January-20-2023_06+19PM-0000000/checkpoint_1500.pth --config_path /home/caraduf/Models/Overflow/Test_Overflow_22kHz-January-20-2023_06+19PM-0000000/config.json --vocoder_name vocoder_models/en/ljspeech/hifigan_v2 --out_path test_own_overflow.wav I get a ValueError :

 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
Traceback (most recent call last):
  File "/home/CoquiTTS/coquienv/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/CoquiTTS/TTS/TTS/bin/synthesize.py", line 316, in main
    synthesizer = Synthesizer(
  File "/home/CoquiTTS/TTS/TTS/utils/synthesizer.py", line 75, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/home/CoquiTTS/TTS/TTS/utils/synthesizer.py", line 108, in _load_tts
    self.tts_config = load_config(tts_config_path)
  File "/home/CoquiTTS/TTS/TTS/config/__init__.py", line 96, in load_config
    config.from_dict(config_dict)
  File "/home/CoquiTTS/coquienv/lib/python3.10/site-packages/coqpit/coqpit.py", line 694, in from_dict
    self = self.deserialize(data)  # pylint: disable=self-cls-assignment
  File "/home/CoquiTTS/coquienv/lib/python3.10/site-packages/coqpit/coqpit.py", line 412, in deserialize
    value = _deserialize(value, field.type)
  File "/home/CoquiTTS/coquienv/lib/python3.10/site-packages/coqpit/coqpit.py", line 284, in _deserialize
    return _deserialize_list(x, field_type)
  File "/home/CoquiTTS/coquienv/lib/python3.10/site-packages/coqpit/coqpit.py", line 221, in _deserialize_list
    return [_deserialize(xi, field_arg) for xi in x]
  File "/home/CoquiTTS/coquienv/lib/python3.10/site-packages/coqpit/coqpit.py", line 221, in <listcomp>
    return [_deserialize(xi, field_arg) for xi in x]
  File "/home/CoquiTTS/coquienv/lib/python3.10/site-packages/coqpit/coqpit.py", line 288, in _deserialize
    return field_type.deserialize_immutable(x)
  File "/home/CoquiTTS/coquienv/lib/python3.10/site-packages/coqpit/coqpit.py", line 426, in deserialize_immutable
    raise ValueError()
ValueError

I previously tested it with a checkpoint at 150k steps trained with 16kHz samples and had the same ValueError during inference.

Here is the config.json :

{
    "output_path": "/home/caraduf/Models/Overflow",
    "logger_uri": null,
    "run_name": "Test_Overflow_22kHz",
    "project_name": null,
    "run_description": "\ud83d\udc38Coqui trainer run.",
    "print_step": 1,
    "plot_step": 1,
    "model_param_stats": false,
    "wandb_entity": null,
    "dashboard_logger": "tensorboard",
    "log_model_step": null,
    "save_step": 500,
    "save_n_checkpoints": 5,
    "save_checkpoints": true,
    "save_all_best": false,
    "save_best_after": 10000,
    "target_loss": null,
    "print_eval": true,
    "test_delay_epochs": -1,
    "run_eval": true,
    "run_eval_steps": 100,
    "distributed_backend": "nccl",
    "distributed_url": "tcp://localhost:54321",
    "mixed_precision": true,
    "epochs": 20001,
    "batch_size": 32,
    "eval_batch_size": 16,
    "grad_clip": 40000.0,
    "scheduler_after_epoch": true,
    "lr": 0.001,
    "optimizer": "Adam",
    "optimizer_params": {
        "weight_decay": 1e-06
    },
    "lr_scheduler": null,
    "lr_scheduler_params": {},
    "use_grad_scaler": false,
    "cudnn_enable": true,
    "cudnn_deterministic": false,
    "cudnn_benchmark": false,
    "training_seed": 54321,
    "model": "Overflow",
    "num_loader_workers": 4,
    "num_eval_loader_workers": 2,
    "use_noise_augment": false,
    "audio": {
        "fft_size": 1024,
        "win_length": 1024,
        "hop_length": 256,
        "frame_shift_ms": null,
        "frame_length_ms": null,
        "stft_pad_mode": "reflect",
        "sample_rate": 22050,
        "resample": false,
        "preemphasis": 0.0,
        "ref_level_db": 20,
        "do_sound_norm": false,
        "log_func": "np.log",
        "do_trim_silence": true,
        "trim_db": 60.0,
        "do_rms_norm": false,
        "db_level": null,
        "power": 1.5,
        "griffin_lim_iters": 60,
        "num_mels": 80,
        "mel_fmin": 0.0,
        "mel_fmax": 8000,
        "spec_gain": 1.0,
        "do_amp_to_db_linear": true,
        "do_amp_to_db_mel": true,
        "pitch_fmax": 640.0,
        "pitch_fmin": 1.0,
        "signal_norm": false,
        "min_level_db": -100,
        "symmetric_norm": true,
        "max_norm": 4.0,
        "clip_norm": true,
        "stats_path": null
    },
    "use_phonemes": true,
    "phonemizer": "espeak",
    "phoneme_language": "fr-fr",
    "compute_input_seq_cache": false,
    "text_cleaner": "multilingual_cleaners",
    "enable_eos_bos_chars": false,
    "test_sentences_file": "",
    "phoneme_cache_path": "/home/caraduf/Models/Overflow/Test_Overflow_22kHz-January-20-2023_06+19PM-0000000/phoneme_cache",
    "characters": {
        "characters_class": "TTS.tts.utils.text.characters.IPAPhonemes",
        "vocab_dict": null,
        "pad": "<PAD>",
        "eos": "<EOS>",
        "bos": "<BOS>",
        "blank": "<BLNK>",
        "characters": "iy\u0268\u0289\u026fu\u026a\u028f\u028ae\u00f8\u0258\u0259\u0275\u0264o\u025b\u0153\u025c\u025e\u028c\u0254\u00e6\u0250a\u0276\u0251\u0252\u1d7b\u0298\u0253\u01c0\u0257\u01c3\u0284\u01c2\u0260\u01c1\u029bpbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029f\u02c8\u02cc\u02d0\u02d1\u028dw\u0265\u029c\u02a2\u02a1\u0255\u0291\u027a\u0267\u02b2\u025a\u02de\u026b",
        "punctuations": "!'(),-.:;? ",
        "phonemes": null,
        "is_unique": false,
        "is_sorted": true
    },
    "add_blank": false,
    "batch_group_size": 0,
    "loss_masking": null,
    "min_audio_len": 512,
    "max_audio_len": 200000,
    "min_text_len": 10,
    "max_text_len": 500,
    "compute_f0": false,
    "compute_linear_spec": false,
    "precompute_num_workers": 4,
    "start_by_longest": true,
    "shuffle": false,
    "drop_last": false,
    "datasets": [
        [
            {
                "formatter": "ljspeech",
                "dataset_name": "Own_1",
                "path": "/home/caraduf/Datasets/22kHz/Own_1_22.05kHz_dataset",
                "meta_file_train": "metadata.csv",
                "ignored_speakers": null,
                "language": "fr-fr",
                "meta_file_val": "",
                "meta_file_attn_mask": ""
            },
            {
                "formatter": "ljspeech",
                "dataset_name": "Own_2",
                "path": "/home/caraduf/Datasets/22kHz/Own_2_22.05kHz_dataset",
                "meta_file_train": "metadata.csv",
                "ignored_speakers": null,
                "language": "fr-fr",
                "meta_file_val": "",
                "meta_file_attn_mask": ""
            },
            {
                "formatter": "ljspeech",
                "dataset_name": "Own_3",
                "path": "/home/caraduf/Datasets/22kHz/Own_2_22.05kHz_dataset",
                "meta_file_train": "metadata.csv",
                "ignored_speakers": null,
                "language": "fr-fr",
                "meta_file_val": "",
                "meta_file_attn_mask": ""
            }
        ]
    ],
    "test_sentences": [
        "Il m'a fallu du temps pour obtenir cette voix, alors je ne vais pas me taire!",
        "Salut c'est l'\u00e9t\u00e9, on va s'\u00e9clater",
        "Mais son age rendait cette derni\u00e8re qualit\u00e9 plus saillante!"
    ],
    "eval_split_max_size": null,
    "eval_split_size": 0.01,
    "use_speaker_weighted_sampler": false,
    "speaker_weighted_sampler_alpha": 1.0,
    "use_language_weighted_sampler": false,
    "language_weighted_sampler_alpha": 1.0,
    "use_length_weighted_sampler": false,
    "length_weighted_sampler_alpha": 1.0,
    "force_generate_statistics": false,
    "mel_statistics_parameter_path": "/home/caraduf/Models/Overflow/Test_Overflow_22kHz-January-20-2023_06+19PM-0000000/stat_parameters.pt",
    "num_chars": 131,
    "state_per_phone": 2,
    "encoder_in_out_features": 512,
    "encoder_n_convolutions": 3,
    "out_channels": 80,
    "ar_order": 1,
    "sampling_temp": 0.334,
    "deterministic_transition": true,
    "duration_threshold": 0.55,
    "use_grad_checkpointing": true,
    "max_sampling_time": 1000,
    "prenet_type": "original",
    "prenet_dim": 256,
    "prenet_n_layers": 2,
    "prenet_dropout": 0.5,
    "prenet_dropout_at_inference": false,
    "memory_rnn_dim": 1024,
    "outputnet_size": [
        1024
    ],
    "flat_start_params": {
        "mean": 0.0,
        "std": 1.0,
        "transition_p": 0.14
    },
    "std_floor": 0.01,
    "hidden_channels_dec": 150,
    "kernel_size_dec": 5,
    "dilation_rate": 1,
    "num_flow_blocks_dec": 12,
    "num_block_layers": 4,
    "dropout_p_dec": 0.05,
    "num_splits": 4,
    "num_squeeze": 2,
    "sigmoid_scale": false,
    "c_in_channels": 0,
    "r": 1,
    "use_d_vector_file": false,
    "use_speaker_embedding": false,
    "github_branch": "inside_docker"
}

If I try the default command tts --text "Hello world!" --model_name tts_models/en/ljspeech/overflow --vocoder_name vocoder_models/en/ljspeech/hifigan_v2 --out_path output.wav I get the wav output as expected.

To Reproduce

Train OverFlow model with the provided recipe.

Wait for a checkpoint to be written.

Run an inference on that checkpoint with tts --text "Bonjour les amis" --model_path /home/caraduf/Models/Overflow/Test_Overflow_22kHz-January-20-2023_06+19PM-0000000/checkpoint_1500.pth --config_path /home/caraduf/Models/Overflow/Test_Overflow_22kHz-January-20-2023_06+19PM-0000000/config.json --vocoder_name vocoder_models/en/ljspeech/hifigan_v2 --out_path test_own_overflow.wav

A ValueError appears and no wav is written to disk.

Expected behavior

A wav file should be written to disk.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.13.0+cu117",
        "TTS": "0.10.0",
        "numpy": "1.22.4"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.6",
        "version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023"
    }
}

Additional context

No response

Ca-ressemble-a-du-fake commented 1 year ago

Printing data object in coqpit deserialize_immutable method shows that datasets is a list instead of a dict. Actually when watching carefully at the generated config.json it shows "datasets": [ [ {...}, {...} ] ] instead of "datasets": [ {...}, {...} ]. Manually removing the useless [] solves the problem.

When training VITS these useless [] do not appear. So they are only generated while training OverFlow. So the culprit should be the function that generates the json from the train_overflow.py recipe.

erogol commented 1 year ago

@shivammehta25 could you check this one?

shivammehta25 commented 1 year ago

Sure! You can assign it to me, I will take a look at it as soon as I can.

shivammehta25 commented 1 year ago

Hi! When training with a single dataset, I couldn't replicate the error. Could you please share the training script/recipe that you used for this? I feel there are extra brackets in the datasets than what is supposed to be in the config.json and the datasets parameter is populated in the training recipe.

Ca-ressemble-a-du-fake commented 1 year ago

Hi ! I used the recipe provided in the repo.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

coqui-ai / TTS