[Bug] Parameter Type Mismatch When Training Overflow

stanleyshly commented 1 year ago

Describe the bug

I'm running into a parameter type mismatch during the lstm layer in the torch library.

To Reproduce

I'm currently executing the recipe for overflow in the ljspeech folder. I've downloaded the dataset and everything, but when I run the script, I get the bellow error:

Expected behavior

No response

Logs

Traceback (most recent call last):
  File "/home/workstation/.local/lib/python3.10/site-packages/trainer/trainer.py", line 1591, in fit
    self._fit()
  File "/home/workstation/.local/lib/python3.10/site-packages/trainer/trainer.py", line 1544, in _fit
    self.train_epoch()
  File "/home/workstation/.local/lib/python3.10/site-packages/trainer/trainer.py", line 1309, in train_epoch
    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "/home/workstation/.local/lib/python3.10/site-packages/trainer/trainer.py", line 1141, in train_step
    outputs, loss_dict_new, step_time = self._optimize(
  File "/home/workstation/.local/lib/python3.10/site-packages/trainer/trainer.py", line 1025, in _optimize
    outputs, loss_dict = self._model_train_step(batch, model, criterion)
  File "/home/workstation/.local/lib/python3.10/site-packages/trainer/trainer.py", line 970, in _model_train_step
    return model.train_step(*input_args)
  File "/home/workstation/.local/lib/python3.10/site-packages/TTS/tts/models/overflow.py", line 174, in train_step
    outputs = self.forward(
  File "/home/workstation/.local/lib/python3.10/site-packages/TTS/tts/models/overflow.py", line 144, in forward
    encoder_outputs, encoder_output_len = self.encoder(text, text_len)
  File "/home/workstation/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/workstation/.local/lib/python3.10/site-packages/TTS/tts/layers/overflow/common_layers.py", line 64, in forward
    o, _ = self.lstm(o)
  File "/home/workstation/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/workstation/.local/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 832, in forward
    result = _VF.lstm(input, batch_sizes, hx, self._flat_weights, self.bias,
RuntimeError: parameter types mismatch

Environment

{
    "CUDA": {
        "GPU": [
            "AMD Radeon RX 6600 XT"
        ],
        "available": true,
        "version": null
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.0.dev20230409+rocm5.4.2",
        "TTS": "0.12.0",
        "numpy": "1.22.4"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.6",
        "version": "#39~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 17 21:16:15 UTC 2"
    }
}

Additional context

No response

shivammehta25 commented 1 year ago

Are you training from scratch or finetuning the checkpoint? Could you confirm that both the model and the data are on the same device?

stanleyshly commented 1 year ago

I am training from scratch using the recipe here: . How can I confirm that the model and data are on the same device? I've specified gpu=0 though in the recipe.

shivammehta25 commented 1 year ago

I tried the default recipe, with the stable version of torch torch==2.0.0+cu118. I could not replicate the issue, could share your config file and training script? And also try downgrading to a stable torch version instead of a dev build.

stanleyshly commented 1 year ago

I will try that. It could also be error in Pytorch ROCm.

Here is my training script:

import os

from trainer import Trainer, TrainerArgs

from TTS.config.shared_configs import BaseAudioConfig
from TTS.tts.configs.overflow_config import OverflowConfig
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.overflow import Overflow
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor

output_path = os.path.dirname(os.path.abspath(__file__))

# init configs
dataset_config = BaseDatasetConfig(
    formatter="ljspeech", meta_file_train="metadata.csv", path=os.path.join("data", "LJSpeech-1.1/")
)

audio_config = BaseAudioConfig(
    sample_rate=22050,
    do_trim_silence=True,
    trim_db=60.0,
    signal_norm=False,
    mel_fmin=0.0,
    mel_fmax=8000,
    spec_gain=1.0,
    log_func="np.log",
    ref_level_db=20,
    preemphasis=0.0,
)

config = OverflowConfig(  # This is the config that is saved for the future use
    run_name="overflow_ljspeech",
    audio=audio_config,
    batch_size=30,
    eval_batch_size=16,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1000,
    text_cleaner="phoneme_cleaners",
    use_phonemes=True,
    phoneme_language="en-us",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    precompute_num_workers=8,
    mel_statistics_parameter_path=os.path.join(output_path, "lj_parameters.pt"),
    force_generate_statistics=False,
    print_step=1,
    print_eval=True,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
)

# INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)

# INITIALIZE THE TOKENIZER
# Tokenizer is used to convert text to sequences of token IDs.
# If characters are not defined in the config, default characters are passed to the config
tokenizer, config = TTSTokenizer.init_from_config(config)

# LOAD DATA SAMPLES
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
# You can define your custom sample loader returning the list of samples.
# Or define your custom formatter and pass it to the `load_tts_samples`.
# Check `TTS.tts.datasets.load_tts_samples` for more details.
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

# INITIALIZE THE MODEL
# Models take a config object and a speaker manager as input
# Config defines the details of the model like the number of layers, the size of the embedding, etc.
# Speaker manager is used by multi-speaker models.
model = Overflow(config, ap, tokenizer)

# init the trainer and 🚀
trainer = Trainer(
    TrainerArgs(),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
    gpu=1,
)
trainer.fit()

shivammehta25 commented 1 year ago

Yeah! I am using the exact same recipe, could you try running a basic LSTM network and see if this persists?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

coqui-ai / TTS