coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
35.29k stars 4.31k forks source link

[Bug] GlowTTS / Tacotron2 Training stuck and fail #3529

Closed gabrielelanzafamee closed 8 months ago

gabrielelanzafamee commented 9 months ago

Describe the bug

Hi, I'm trying to train GlowTTS and Tacotron2 with an Dataset with the same format of LJSpeech.

I used the same dataset to train it with XTTS v2 and it worked but when I try to train GlowTTS or Tacotron2 it's look like that is stuck and return an exeception.

This is the dataset: https://huggingface.co/datasets/xjabr/british_old_lady

To Reproduce

This is the code:

import os

# Trainer: Where the ✨️ happens.
# TrainingArgs: Defines the set of arguments of the Trainer.
from trainer import Trainer, TrainerArgs

# GlowTTSConfig: all model related values for training, validating and testing.
from TTS.tts.configs.glow_tts_config import GlowTTSConfig

# BaseDatasetConfig: defines name, formatter and path of the dataset.
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.glow_tts import GlowTTS
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor

# we use the same path as this script as our training folder.
output_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "run", "glow_tts")

# DEFINE DATASET CONFIG
# Set LJSpeech as our target dataset and define its path.
# You can also use a simple Dict to define the dataset and pass it to your custom formatter.
dataset_config = BaseDatasetConfig(
    formatter="ljspeech",
    dataset_name="old-lady",
    path="data/old-lady",
    meta_file_train="metadata.csv",
)

# INITIALIZE THE TRAINING CONFIGURATION
# Configure the model. Every config class inherits the BaseTTSConfig.
config = GlowTTSConfig(
    batch_size=16,
    eval_batch_size=4,
    num_loader_workers=1,
    num_eval_loader_workers=1,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=10,
    text_cleaner="phoneme_cleaners",
    use_phonemes=False,
    phoneme_language="en-us",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    print_step=25,
    print_eval=False,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
)

# INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)

# INITIALIZE THE TOKENIZER
# Tokenizer is used to convert text to sequences of token IDs.
# If characters are not defined in the config, default characters are passed to the config
tokenizer, config = TTSTokenizer.init_from_config(config)

# LOAD DATA SAMPLES
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
# You can define your custom sample loader returning the list of samples.
# Or define your custom formatter and pass it to the `load_tts_samples`.
# Check `TTS.tts.datasets.load_tts_samples` for more details.
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

# INITIALIZE THE MODEL
# Models take a config object and a speaker manager as input
# Config defines the details of the model like the number of layers, the size of the embedding, etc.
# Speaker manager is used by multi-speaker models.
model = GlowTTS(config, ap, tokenizer, speaker_manager=None)

# INITIALIZE THE TRAINER
# Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training,
# distributed training, etc.
trainer = Trainer(
    TrainerArgs(), config, output_path, model=model, train_samples=train_samples, eval_samples=eval_samples
)

# AND... 3,2,1... 🚀
trainer.fit()

Expected behavior

No response

Logs

root@3726f96612ad:/workspace# python train_glowtts.py 
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:45
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 | > Found 227 files in /workspace/data/old-lady
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
 > Training Environment:
 | > Backend: Torch
 | > Mixed precision: True
 | > Precision: fp16
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 64
 | > Num. of Torch Threads: 32
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 | > Torch TF32 MatMul: False
 > Start Tensorboard: tensorboard --logdir=/workspace/run/glow_tts/run-January-19-2024_10+00AM-0000000

 > Model has 28597969 parameters

 > EPOCH: 0/10
 --> /workspace/run/glow_tts/run-January-19-2024_10+00AM-0000000

> DataLoader initialization
| > Tokenizer:
        | > add_blank: False
        | > use_eos_bos: False
        | > use_phonemes: False
| > Number of instances : 225
 | > Preprocessing samples
 | > Max text length: 102
 | > Min text length: 15
 | > Avg text length: 56.111111111111114
 | 
 | > Max audio length: 496125
 | > Min audio length: 66150
 | > Avg audio length: 171205.46666666667
 | > Num. instances discarded samples: 0
 | > Batch group size: 0.

 > TRAINING (2024-01-19 10:00:31) 
/usr/local/lib/python3.10/dist-packages/librosa/core/spectrum.py:256: UserWarning: n_fft=1024 is too large for input signal of length=2
  warnings.warn(
 ! Run is removed from /workspace/run/glow_tts/run-January-19-2024_10+00AM-0000000
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 1833, in fit
    self._fit()
  File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 1785, in _fit
    self.train_epoch()
  File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 1503, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 694, in reraise
    raise exception
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.10/dist-packages/TTS/tts/datasets/dataset.py", line 475, in collate_fn
    mel = prepare_tensor(mel, self.outputs_per_step)
  File "/usr/local/lib/python3.10/dist-packages/TTS/tts/utils/data.py", line 29, in prepare_tensor
    return np.stack([_pad_tensor(x, pad_len) for x in inputs])
  File "/usr/local/lib/python3.10/dist-packages/TTS/tts/utils/data.py", line 29, in <listcomp>
    return np.stack([_pad_tensor(x, pad_len) for x in inputs])
  File "/usr/local/lib/python3.10/dist-packages/TTS/tts/utils/data.py", line 20, in _pad_tensor
    assert x.ndim == 2
AssertionError

### Environment

```shell
{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 4090"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.2+cu121",
        "TTS": "0.22.0",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "x86_64",
        "python": "3.10.12",
        "version": "#187-Ubuntu SMP Thu Nov 23 14:52:28 UTC 2023"
    }
}

Additional context

No response

stale[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.