size mismatch for context_lstm.weight_ih_l0 and context_lstm.weight_ih_l0_reverse

skyler14 commented 2 years ago

When I try to run the basic inference demo I get mismatches for the dimensionality of the pretrained models and context_lstm. There isn't really any location in any of the project files where 1044 or 1040 is directly specified if you search the content of every file, so this leaves me without really any context for how to start the inference. Here is a full command with verbose absolute paths for all input files using the downloaded files.

python inference.py -c '/home/skyler/Documents/RADTTS/configs/config_ljs_radtts.json' -r '/home/skyler/Documents/RADTTS/model_archives/radtts++ljs-dap.pt' -v '/home/skyler/Documents/RADTTS/model_archives/hifigan_libritts100360_generator0p5.pt' -k '/home/skyler/Documents/RADTTS/model_archives/hifigan_22khz_config.json' -t '/home/skyler/Documents/RADTTS/input/test.txt' --speaker_attributes ljs --speaker_text ljs -o results/

Applying spectral norm to text encoder LSTM
/home/skyler/Documents/RADTTS/common.py:391: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at  ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2497.)
  W = torch.qr(torch.FloatTensor(c, c).normal_())[0]
Applying spectral norm to context encoder LSTM
Traceback (most recent call last):
  File "/home/skyler/Documents/RADTTS/inference.py", line 201, in <module>
    infer(args.radtts_path, args.vocoder_path, args.config_vocoder,
  File "/home/skyler/Documents/RADTTS/inference.py", line 95, in infer
    radtts.load_state_dict(state_dict, strict=False)
  File "/home/skyler/miniconda3/envs/radtts/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RADTTS:
    size mismatch for context_lstm.weight_ih_l0: copying a param with shape torch.Size([2080, 1044]) from checkpoint, the shape in current model is torch.Size([2080, 1040]).
    size mismatch for context_lstm.weight_ih_l0_reverse: copying a param with shape torch.Size([2080, 1044]) from checkpoint, the shape in current model is torch.Size([2080, 1040]).

rafaelvalle commented 2 years ago

config_radtts.json is the config path for radtts without pitch conditioning. the pre-trained model we shared uses pitch conditioned and the config that includes the attribute predictor, which you'll need for inference, is config_ljs_dap.json

skyler14 commented 2 years ago

So would I be correct in assuming that ljs_audiopath_text_sid_emotion_duration_train_filelist.txt is just a placeholder name for the respective data. If so Then I'm still getting a funky error when I rename the config file to point toward your filelists file

/home/skyler/Documents/RADTTS/common.py:391: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at  ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2497.)
  W = torch.qr(torch.FloatTensor(c, c).normal_())[0]
Applying spectral norm to context encoder LSTM
Loaded checkpoint '/home/skyler/Documents/RADTTS/model_archives/radtts++ljs-dap.pt')
Traceback (most recent call last):
  File "/home/skyler/Documents/RADTTS/inference.py", line 201, in <module>
    infer(args.radtts_path, args.vocoder_path, args.config_vocoder,
  File "/home/skyler/Documents/RADTTS/inference.py", line 100, in infer
    trainset = Data(
  File "/home/skyler/Documents/RADTTS/data.py", line 95, in __init__
    self.data = self.load_data(datasets)
  File "/home/skyler/Documents/RADTTS/data.py", line 180, in load_data
    with open(filelist_path, encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/filelists/ljs_audiopaths_text_sid_train_filelist.txt'

the relevant section of this file was adjusted to include those paths with the names during time of download.

"data_config": {
        "training_files": {
            "LJS": {
                "basedir": "/filelists/",
                "audiodir": "wavs",
                "filelist": "ljs_audiopaths_text_sid_train_filelist.txt",
                "lmdbpath": ""
            }
        },
        "validation_files": {
            "LJS": {
                "basedir": "/filelists/",
                "audiodir": "wavs",
                "filelist": "ljs_audiopaths_text_sid_val_filelist.txt",
                "lmdbpath": ""
            }
        },

located in the respective folder

rafaelvalle commented 2 years ago

very likely this is a typo and the firsth dash shouldn't be there. if it does not work, try passing the full pass to the filelists folder "basedir": "/filelists/",

skyler14 commented 2 years ago

Changing the first slash seems to resolve it, though I'm getting a KeyError: None when I think its trying to pick a speaker

/home/skyler/Documents/RADTTS/common.py:391: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at  ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2497.)
  W = torch.qr(torch.FloatTensor(c, c).normal_())[0]
Applying spectral norm to context encoder LSTM
Loaded checkpoint '/home/skyler/Documents/RADTTS/model_archives/radtts++ljs-dap.pt')
Number of speakers: 1
Speaker IDS {'0': 0}
Number of files 12442
Number of files after duration filtering 12442
Dataloader initialized with no augmentations
Traceback (most recent call last):
  File "/home/skyler/Documents/RADTTS/inference.py", line 201, in <module>
    infer(args.radtts_path, args.vocoder_path, args.config_vocoder,
  File "/home/skyler/Documents/RADTTS/inference.py", line 104, in infer
    speaker_id = trainset.get_speaker_id(speaker).cuda()
  File "/home/skyler/Documents/RADTTS/data.py", line 279, in get_speaker_id
    return torch.LongTensor([self.speaker_ids[speaker]])
KeyError: None

does this have to do with the hitherto unspecified warmstart in my configs?

        "vocoder_config_path": "models/hifigan_22khz_config.json",
        "vocoder_checkpoint_path": "models/hifigan_libritts100360_generator0p5.pt",
        "log_attribute_samples": true,
        "log_decoder_samples": true,
        "warmstart_checkpoint_path": "path/to/pretrained/decoder",

I'm already specifying what model to use aren't I with this command:

python inference.py -c '/home/skyler/Documents/RADTTS/configs/config_ljs_dap.json' -r '/home/skyler/Documents/RADTTS/model_archives/radtts++ljs-dap.pt' -v '/home/skyler/Documents/RADTTS/model_archives/hifigan_libritts100360_generator0p5.pt' -k '/home/skyler/Documents/RADTTS/model_archives/hifigan_22khz_config.json' -t '/home/skyler/Documents/RADTTS/input/test.txt' --speaker_attributes ljs --speaker_text ljs -o results/

rafaelvalle commented 2 years ago

please pull and try adapting your your test.txt to follow the format here: https://github.com/NVIDIA/radtts/blob/main/data/vc_audiopath_txt_speaker_emotion_duration_filelist.txt

skyler14 commented 2 years ago

The problem is I get the same error even if I do use that exact file as my input file. Here is me running

python inference.py -c '/home/skyler/Documents/RADTTS/configs/config_ljs_dap.json' -r '/home/skyler/Documents/RADTTS/model_archives/radtts++ljs-dap.pt' -v '/home/skyler/Documents/RADTTS/model_archives/hifigan_libritts100360_generator0p5.pt' -k '/home/skyler/Documents/RADTTS/model_archives/hifigan_22khz_config.json' -t '/home/skyler/Documents/RADTTS/data/vc_audiopath_txt_speaker_emotion_duration_filelist.txt' --speaker_attributes ljs --speaker_text ljs -o results/

using your exact same file with the content as the contents of vc_audiopath_txt_speaker_emotion_duration_filelist.txt

kamala_source.wav|I mentor a lot of people and I tell them, that there will be people who will say it's not your turn, it's not your time, no one like you has done it,|ljs|other|8.550

As you can see below it still gives the same error and speaker ID is indicated as '0' as your previous pre-edit post was talking about.

Applying spectral norm to text encoder LSTM
/home/skyler/Documents/RADTTS/common.py:391: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at  ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2497.)
  W = torch.qr(torch.FloatTensor(c, c).normal_())[0]
Applying spectral norm to context encoder LSTM
Loaded checkpoint '/home/skyler/Documents/RADTTS/model_archives/radtts++ljs-dap.pt')
Number of speakers: 1
Speaker IDS {'0': 0}
Number of files 12442
Number of files after duration filtering 12442
Dataloader initialized with no augmentations
Traceback (most recent call last):
  File "/home/skyler/Documents/RADTTS/inference.py", line 201, in <module>
    infer(args.radtts_path, args.vocoder_path, args.config_vocoder,
  File "/home/skyler/Documents/RADTTS/inference.py", line 104, in infer
    speaker_id = trainset.get_speaker_id(speaker).cuda()
  File "/home/skyler/Documents/RADTTS/data.py", line 279, in get_speaker_id
    return torch.LongTensor([self.speaker_ids[speaker]])
KeyError: None

Additionally I was thinking there is no wavs folder that exists audio files but you guys actually never had a wav folder or mentioned it anywhere. The closest contender is the 22khz folder in data.


    },
    "data_config": {
        "training_files": {
            "LJS": {
                "basedir": "filelists/",
                "audiodir": "22khz",
                "filelist": "ljs_audiopaths_text_sid_train_filelist.txt",
                "lmdbpath": ""
            }
        },
        "validation_files": {
            "LJS": {
                "basedir": "filelists/",
                "audiodir": "22khz",
                "filelist": "ljs_audiopaths_text_sid_val_filelist.txt",
                "lmdbpath": ""
            }
        },

I've tried leaving kamala_source in the project root, data/22khz, data, and filelist (where my training and validation data is) with that audiodir set to wav, wav/ ,22khx, 22khx/

Here is the whole config file, which overall is pretty minimally altered from the initial file:

{
    "train_config": {
        "output_directory": "/debug",
        "epochs": 10000000,
        "optim_algo": "RAdam",
        "learning_rate": 1e-3,
        "weight_decay": 1e-6,
        "sigma": 1.0,
        "iters_per_checkpoint": 1000,
        "batch_size": 32,
        "seed": null,
        "checkpoint_path": "",
        "ignore_layers": [],
        "ignore_layers_warmstart": [],
        "finetune_layers": [],
        "include_layers": [],
        "vocoder_config_path": "models/hifigan_22khz_config.json",
        "vocoder_checkpoint_path": "models/hifigan_libritts100360_generator0p5.pt",
        "log_attribute_samples": true,
        "log_decoder_samples": true,
        "warmstart_checkpoint_path": "path/to/pretrained/decoder",
        "use_amp": true,
        "grad_clip_val": 1.0,
        "loss_weights": {
            "blank_logprob": -1,
            "ctc_loss_weight": 0.1,
            "binarization_loss_weight": 1.0,
            "dur_loss_weight": 1.0,
            "f0_loss_weight": 1.0,
            "energy_loss_weight": 1.0,
            "vpred_loss_weight": 1.0
        },
        "binarization_start_iter": 0,
        "kl_loss_start_iter": 0,
        "unfreeze_modules": "durf0energyvpred"
    },
    "data_config": {
        "training_files": {
            "LJS": {
                "basedir": "filelists/",
                "audiodir": "22khz",
                "filelist": "ljs_audiopaths_text_sid_train_filelist.txt",
                "lmdbpath": ""
            }
        },
        "validation_files": {
            "LJS": {
                "basedir": "filelists/",
                "audiodir": "22khz",
                "filelist": "ljs_audiopaths_text_sid_val_filelist.txt",
                "lmdbpath": ""
            }
        },
        "dur_min": 0.1,
        "dur_max": 10.2,
        "sampling_rate": 22050,
        "filter_length": 1024,
        "hop_length": 256,
        "win_length": 1024,
        "n_mel_channels": 80,
        "mel_fmin": 0.0,
        "mel_fmax": 8000.0,
        "f0_min": 80.0,
        "f0_max": 640.0,
        "max_wav_value": 32768.0,
        "use_f0": true,
        "use_log_f0": 0,
        "use_energy_avg": true,
        "use_scaled_energy": true,
        "symbol_set": "radtts",
        "cleaner_names": ["radtts_cleaners"],
        "heteronyms_path": "tts_text_processing/heteronyms",
        "phoneme_dict_path": "tts_text_processing/cmudict-0.7b",
        "p_phoneme": 1.0,
        "handle_phoneme": "word",
        "handle_phoneme_ambiguous": "ignore",
        "include_speakers": null,
        "n_frames": -1,
        "betabinom_cache_path": "data_cache/",
        "lmdb_cache_path": "",
        "use_attn_prior_masking": true,
        "prepend_space_to_text": true,
        "append_space_to_text": true,
        "add_bos_eos_to_text": false,
        "betabinom_scaling_factor": 1.0,
        "distance_tx_unvoiced": false,
        "mel_noise_scale": 0.0
    },
    "dist_config": {
        "dist_backend": "nccl",
        "dist_url": "tcp://localhost:54321"
    },
    "model_config": {
        "n_speakers": 1,
        "n_speaker_dim": 16,
        "n_text": 185,
        "n_text_dim": 512,
        "n_flows": 8,
        "n_conv_layers_per_step": 4,
        "n_mel_channels": 80,
        "n_hidden": 1024,
        "mel_encoder_n_hidden": 512,
        "dummy_speaker_embedding": false,
        "n_early_size": 2,
        "n_early_every": 2,
        "n_group_size": 2,
        "affine_model": "wavenet",
        "include_modules": "decatndpmvpredapm",
        "scaling_fn": "tanh",
        "matrix_decomposition": "LUS",
        "learn_alignments": true,
        "use_speaker_emb_for_alignment": false,
        "attn_straight_through_estimator": true,
        "use_context_lstm": true,
        "context_lstm_norm": "spectral",
        "context_lstm_w_f0_and_energy": true,
        "text_encoder_lstm_norm": "spectral",
        "n_f0_dims": 1,
        "n_energy_avg_dims": 1,
        "use_first_order_features": false,
        "unvoiced_bias_activation": "relu",
        "decoder_use_partial_padding": true,
        "decoder_use_unvoiced_bias": true,
        "ap_pred_log_f0": true,
        "ap_use_unvoiced_bias": false,
        "ap_use_voiced_embeddings": true,
        "dur_model_config": {
            "name": "dap",
            "hparams": {
                "n_speaker_dim": 16,
                "bottleneck_hparams": {
                    "in_dim": 512,
                    "reduction_factor": 16,
                    "norm": "weightnorm",
                    "non_linearity": "relu"
                },
                "take_log_of_input": true,
                "arch_hparams": {
                    "out_dim": 1,
                    "n_layers": 2,
                    "n_channels": 256,
                    "kernel_size": 3,
                    "p_dropout": 0.25
                }
            }
        },
        "f0_model_config": {
            "name": "dap",
            "hparams": {
                "n_speaker_dim": 16,
                "bottleneck_hparams": {
                    "in_dim": 512,
                    "reduction_factor": 16,
                    "norm": "weightnorm",
                    "non_linearity": "relu"
                },
                "take_log_of_input": false,
                "use_transformer": false,
                "arch_hparams": {
                    "out_dim": 1,
                    "n_layers": 2,
                    "n_channels": 256,
                    "kernel_size": 11,
                    "p_dropout": 0.5
                }
            }
        },
        "energy_model_config": {
            "name": "dap",
            "hparams": {
                "n_speaker_dim": 16,
                "bottleneck_hparams": {
                    "in_dim": 512,
                    "reduction_factor": 16,
                    "norm": "weightnorm",
                    "non_linearity": "relu"
                },
                "take_log_of_input": false,
                "use_transformer": false,
                "arch_hparams": {
                    "out_dim": 1,
                    "n_layers": 2,
                    "n_channels": 256,
                    "kernel_size": 3,
                    "p_dropout": 0.25
                }
            }
        },
        "v_model_config": {
            "name": "dap",
            "hparams": {
                "n_speaker_dim": 16,
                "take_log_of_input": false,
                "bottleneck_hparams": {
                    "in_dim": 512,
                    "reduction_factor": 16,
                    "norm": "weightnorm",
                    "non_linearity": "relu"
                },
                "arch_hparams": {
                    "out_dim": 1,
                    "n_layers": 2,
                    "n_channels": 256,
                    "kernel_size": 3,
                    "p_dropout": 0.5,
                    "lstm_type": "",
                    "use_linear": 1
                }
            }
        }
    }
}

Lastly, for the sake of completeness, this is attempting to do inference without any extra training from the checkpoint so maybe that's the source of my problem. Are there other ancilliary files are made during the course of training which maybe is the source of my errors?

rafaelvalle commented 2 years ago

Did you pull?

On Thu, Jul 21, 2022 at 11:21 PM skyler14 @.***> wrote:

The problem is I get the same error even if I do use that exact file as my input file. Here is me running

python inference.py -c '/home/skyler/Documents/RADTTS/configs/config_ljs_dap.json' -r '/home/skyler/Documents/RADTTS/model_archives/radtts++ljs-dap.pt' -v '/home/skyler/Documents/RADTTS/model_archives/ hifigan_libritts100360_generator0p5.pt' -k '/home/skyler/Documents/RADTTS/model_archives/hifigan_22khz_config.json' -t '/home/skyler/Documents/RADTTS/data/vc_audiopath_txt_speaker_emotion_duration_filelist.txt' --speaker_attributes ljs --speaker_text ljs -o results/

using your exact same file with the content as the contents of vc_audiopath_txt_speaker_emotion_duration_filelist.txt

kamala_source.wav|I mentor a lot of people and I tell them, that there will be people who will say it's not your turn, it's not your time, no one like you has done it,|ljs|other|8.550

As you can see below it still gives the same error and speaker ID is indicated as '0' as your previous pre-edit post was talking about.

Applying spectral norm to text encoder LSTM /home/skyler/Documents/RADTTS/common.py:391: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release. The boolean parameter 'some' has been replaced with a string parameter 'mode'. Q, R = torch.qr(A, some) should be replaced with Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2497.) W = torch.qr(torch.FloatTensor(c, c).normal_())[0] Applying spectral norm to context encoder LSTM Loaded checkpoint '/home/skyler/Documents/RADTTS/model_archives/radtts++ljs-dap.pt') Number of speakers: 1 Speaker IDS {'0': 0} Number of files 12442 Number of files after duration filtering 12442 Dataloader initialized with no augmentations Traceback (most recent call last): File "/home/skyler/Documents/RADTTS/inference.py", line 201, in infer(args.radtts_path, args.vocoder_path, args.config_vocoder, File "/home/skyler/Documents/RADTTS/inference.py", line 104, in infer speaker_id = trainset.get_speaker_id(speaker).cuda() File "/home/skyler/Documents/RADTTS/data.py", line 279, in get_speaker_id return torch.LongTensor([self.speaker_ids[speaker]]) KeyError: None

Additionally I was thinking there is no wavs folder that exists audio files but you guys actually never had a wav folder or mentioned it anywhere. The closest contender is the 22khz folder in data.

I've tried leaving kamala_source in the project root, data/22khz, data, and filelist (where my training and validation data is) with that audiodir set to wav, wav/ ,22khx, 22khx/

Here is the whole config file, which overall is pretty minimally altered from the initial file:

{ "train_config": { "output_directory": "/debug", "epochs": 10000000, "optim_algo": "RAdam", "learning_rate": 1e-3, "weight_decay": 1e-6, "sigma": 1.0, "iters_per_checkpoint": 1000, "batch_size": 32, "seed": null, "checkpoint_path": "", "ignore_layers": [], "ignore_layers_warmstart": [], "finetune_layers": [], "include_layers": [], "vocoder_config_path": "models/hifigan_22khz_config.json", "vocoder_checkpoint_path": "models/hifigan_libritts100360_generator0p5.pt", "log_attribute_samples": true, "log_decoder_samples": true, "warmstart_checkpoint_path": "path/to/pretrained/decoder", "use_amp": true, "grad_clip_val": 1.0, "loss_weights": { "blank_logprob": -1, "ctc_loss_weight": 0.1, "binarization_loss_weight": 1.0, "dur_loss_weight": 1.0, "f0_loss_weight": 1.0, "energy_loss_weight": 1.0, "vpred_loss_weight": 1.0 }, "binarization_start_iter": 0, "kl_loss_start_iter": 0, "unfreeze_modules": "durf0energyvpred" }, "data_config": { "training_files": { "LJS": { "basedir": "filelists/", "audiodir": "22khz", "filelist": "ljs_audiopaths_text_sid_train_filelist.txt", "lmdbpath": "" } }, "validation_files": { "LJS": { "basedir": "filelists/", "audiodir": "22khz", "filelist": "ljs_audiopaths_text_sid_val_filelist.txt", "lmdbpath": "" } }, "dur_min": 0.1, "dur_max": 10.2, "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "n_mel_channels": 80, "mel_fmin": 0.0, "mel_fmax": 8000.0, "f0_min": 80.0, "f0_max": 640.0, "max_wav_value": 32768.0, "use_f0": true, "use_log_f0": 0, "use_energy_avg": true, "use_scaled_energy": true, "symbol_set": "radtts", "cleaner_names": ["radtts_cleaners"], "heteronyms_path": "tts_text_processing/heteronyms", "phoneme_dict_path": "tts_text_processing/cmudict-0.7b", "p_phoneme": 1.0, "handle_phoneme": "word", "handle_phoneme_ambiguous": "ignore", "include_speakers": null, "n_frames": -1, "betabinom_cache_path": "data_cache/", "lmdb_cache_path": "", "use_attn_prior_masking": true, "prepend_space_to_text": true, "append_space_to_text": true, "add_bos_eos_to_text": false, "betabinom_scaling_factor": 1.0, "distance_tx_unvoiced": false, "mel_noise_scale": 0.0 }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 1, "n_speaker_dim": 16, "n_text": 185, "n_text_dim": 512, "n_flows": 8, "n_conv_layers_per_step": 4, "n_mel_channels": 80, "n_hidden": 1024, "mel_encoder_n_hidden": 512, "dummy_speaker_embedding": false, "n_early_size": 2, "n_early_every": 2, "n_group_size": 2, "affine_model": "wavenet", "include_modules": "decatndpmvpredapm", "scaling_fn": "tanh", "matrix_decomposition": "LUS", "learn_alignments": true, "use_speaker_emb_for_alignment": false, "attn_straight_through_estimator": true, "use_context_lstm": true, "context_lstm_norm": "spectral", "context_lstm_w_f0_and_energy": true, "text_encoder_lstm_norm": "spectral", "n_f0_dims": 1, "n_energy_avg_dims": 1, "use_first_order_features": false, "unvoiced_bias_activation": "relu", "decoder_use_partial_padding": true, "decoder_use_unvoiced_bias": true, "ap_pred_log_f0": true, "ap_use_unvoiced_bias": false, "ap_use_voiced_embeddings": true, "dur_model_config": { "name": "dap", "hparams": { "n_speaker_dim": 16, "bottleneck_hparams": { "in_dim": 512, "reduction_factor": 16, "norm": "weightnorm", "non_linearity": "relu" }, "take_log_of_input": true, "arch_hparams": { "out_dim": 1, "n_layers": 2, "n_channels": 256, "kernel_size": 3, "p_dropout": 0.25 } } }, "f0_model_config": { "name": "dap", "hparams": { "n_speaker_dim": 16, "bottleneck_hparams": { "in_dim": 512, "reduction_factor": 16, "norm": "weightnorm", "non_linearity": "relu" }, "take_log_of_input": false, "use_transformer": false, "arch_hparams": { "out_dim": 1, "n_layers": 2, "n_channels": 256, "kernel_size": 11, "p_dropout": 0.5 } } }, "energy_model_config": { "name": "dap", "hparams": { "n_speaker_dim": 16, "bottleneck_hparams": { "in_dim": 512, "reduction_factor": 16, "norm": "weightnorm", "non_linearity": "relu" }, "take_log_of_input": false, "use_transformer": false, "arch_hparams": { "out_dim": 1, "n_layers": 2, "n_channels": 256, "kernel_size": 3, "p_dropout": 0.25 } } }, "v_model_config": { "name": "dap", "hparams": { "n_speaker_dim": 16, "take_log_of_input": false, "bottleneck_hparams": { "in_dim": 512, "reduction_factor": 16, "norm": "weightnorm", "non_linearity": "relu" }, "arch_hparams": { "out_dim": 1, "n_layers": 2, "n_channels": 256, "kernel_size": 3, "p_dropout": 0.5, "lstm_type": "", "use_linear": 1 } } } } }

Lastly, for the sake of completeness, this is attempting to do inference without any extra training from the checkpoint so maybe that's the source of my problem. Are there other ancilliary files are made during the course of training which maybe is the source of my errors?

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/radtts/issues/6#issuecomment-1192226002, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARSFD7BUBH27I6UA6JZR3LVVI4V3ANCNFSM54GYHTWA . You are receiving this because you commented.Message ID: @.***>

skyler14 commented 2 years ago

I was on the prior commit from from 2 days ago, 9e70ff862b756f7a8a6deedea11876df0f27407b

However even updating to 9f96ac4986cd15a2239f0db9b029fd316b04d4bd

I tried adding the force configs the update readme has in the other inference section just to be more explicit, so I'm running:

python inference.py -c '/home/skyler/Documents/RADTTS/configs/config_ljs_dap.json' -r '/home/skyler/Documents/RADTTS/model_archives/radtts++ljs-dap.pt' -v '/home/skyler/Documents/RADTTS/model_archives/hifigan_libritts100360_generator0p5.pt' -k '/home/skyler/Documents/RADTTS/model_archives/hifigan_22khz_config.json' -t '/home/skyler/Documents/RADTTS/data/vc_audiopath_txt_speaker_emotion_duration_filelist.txt' --speaker_attributes ljs --speaker_text ljs --output_dir results/ -p data_config.validation_files="{'Dummy': {'basedir': 'data/', 'audiodir':'22khz', 'filelist': 'vc_audiopath_txt_speaker_emotion_duration_filelist.txt'}}

Applying spectral norm to text encoder LSTM
/home/skyler/Documents/RADTTS/common.py:391: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at  ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2497.)
  W = torch.qr(torch.FloatTensor(c, c).normal_())[0]
Applying spectral norm to context encoder LSTM
Loaded checkpoint '/home/skyler/Documents/RADTTS/model_archives/radtts++ljs-dap.pt')
Number of speakers: 1
Speaker IDS {'ljs': 0}
Number of files 12442
Number of files after duration filtering 12442
Dataloader initialized with no augmentations
Traceback (most recent call last):
  File "/home/skyler/Documents/RADTTS/inference.py", line 201, in <module>
    infer(args.radtts_path, args.vocoder_path, args.config_vocoder,
  File "/home/skyler/Documents/RADTTS/inference.py", line 104, in infer
    speaker_id = trainset.get_speaker_id(speaker).cuda()
  File "/home/skyler/Documents/RADTTS/data.py", line 279, in get_speaker_id
    return torch.LongTensor([self.speaker_ids[speaker]])
KeyError: None

log is almost same minus speaker IDS

rafaelvalle commented 2 years ago

Please try to run the commands in the README as they are first.

skyler14 commented 2 years ago

That was after I did try running the command as it is shown in inferences section and still had gotten the same error (with absolute paths for files that is):

python inference.py -c '/home/skyler/Documents/RADTTS/configs/config_ljs_dap.json' -r '/home/skyler/Documents/RADTTS/model_archives/radtts++ljs-dap.pt' -v '/home/skyler/Documents/RADTTS/model_archives/hifigan_libritts100360_generator0p5.pt' -k '/home/skyler/Documents/RADTTS/model_archives/hifigan_22khz_config.json' -t '/home/skyler/Documents/RADTTS/data/vc_audiopath_txt_speaker_emotion_duration_filelist.txt' --speaker_attributes ljs --speaker_text ljs -o results/

rafaelvalle commented 2 years ago

-t '/home/skyler/Documents/RADTTS/data/vc_audiopath_txt_speaker_emotion_duration_filelist.txt' -t is used in inference.py and takes in a list of sentences separated by new lines like this file https://github.com/NVIDIA/radtts/blob/main/sentences.txt

skyler14 commented 2 years ago

I'll switch over to using sentences.txt as my input however that still does not change my error at all.

Applying spectral norm to text encoder LSTM
/home/skyler/Documents/RADTTS/common.py:391: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at  ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2497.)
  W = torch.qr(torch.FloatTensor(c, c).normal_())[0]
Applying spectral norm to context encoder LSTM
Loaded checkpoint '/home/skyler/Documents/RADTTS/model_archives/radtts++ljs-dap.pt')
Number of speakers: 1
Speaker IDS {'ljs': 0}
Number of files 12442
Number of files after duration filtering 12442
Dataloader initialized with no augmentations
Traceback (most recent call last):
  File "/home/skyler/Documents/RADTTS/inference.py", line 201, in <module>
    infer(args.radtts_path, args.vocoder_path, args.config_vocoder,
  File "/home/skyler/Documents/RADTTS/inference.py", line 104, in infer
    speaker_id = trainset.get_speaker_id(speaker).cuda()
  File "/home/skyler/Documents/RADTTS/data.py", line 279, in get_speaker_id
    return torch.LongTensor([self.speaker_ids[speaker]])
KeyError: None

Can you provide a hardcoded absolute path sample of how you run the inference code, similar to the examples I've been providing. Maybe that'll help me debug step by step what mistakes I might be making, as well as Python version this has been tested on. Current command looks like:

python inference.py -c '/home/skyler/Documents/RADTTS/configs/config_ljs_dap.json' -r '/home/skyler/Documents/RADTTS/model_archives/radtts++ljs-dap.pt' -v '/home/skyler/Documents/RADTTS/model_archives/hifigan_libritts100360_generator0p5.pt' -k '/home/skyler/Documents/RADTTS/model_archives/hifigan_22khz_config.json' -t '/home/skyler/Documents/RADTTS/sentences.txt' --speaker_attributes ljs --speaker_text ljs --output_dir results/

In your readme can you also maybe add a full docker build instruction set for your docker package (maybe also specify whether you guys are using nvidia docker). I think I should just try to spin this up in a docker image so I can control for everything and perfectly reproduce your results. Currently I think there's stuff missing from the docker file (you actually dont copy your project into the image for example) and when I try building an image I get Exited (2) status. I'm double checking my driver installation as per this nvidia-docker issue https://github.com/NVIDIA/nvidia-docker/issues/464 in the meantime but I suspect that Dockerfile is incomplete and not in a runnable state in this repo.

mepc36 commented 2 years ago

I'm commenting to second the request by @skyler14 for a hardcoded absolute path sample of how you run the inference code, thank you @rafaelvalle!

NVIDIA / radtts

size mismatch for context_lstm.weight_ih_l0 and context_lstm.weight_ih_l0_reverse #6