coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
35.88k stars 4.39k forks source link

[Bug] Continue Training After Adding New Speaker #2148

Closed kin0303 closed 1 year ago

kin0303 commented 2 years ago

Describe the bug

I got an error when training multi speakers after adding a new speaker, before that I'm training a single speaker.

My train.py:

import os

from trainer import Trainer, TrainerArgs

from TTS.config.shared_configs import BaseAudioConfig
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.tacotron2_config import Tacotron2Config
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.tacotron2 import Tacotron2
from TTS.tts.utils.speakers import SpeakerManager
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor

output_path = os.path.dirname(os.path.abspath(__file__))
dataset_config = BaseDatasetConfig(name="botika", meta_file_train="", path=os.path.join(output_path, "/media/botika/DATA-2/TTS/TTS_Coqui/BOTIKA/"))

audio_config = BaseAudioConfig(
    sample_rate=22050,
    resample=True,  # Resample to 22050 Hz. It slows down training. Use `TTS/bin/resample.py` to pre-resample and set this False for faster training.
    do_trim_silence=True,
    trim_db=23.0,
    signal_norm=False,
    mel_fmin=0.0,
    mel_fmax=8000,
    spec_gain=1.0,
    log_func="np.log",
    preemphasis=0.0,
)

config = Tacotron2Config(  # This is the config that is saved for the future use
    audio=audio_config,
    run_name="BOTIKA_NEW",
    batch_size=4,
    eval_batch_size=4,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    r=6,
    gradual_training=[[0, 6, 4], [10000, 4, 4], [50000, 3, 4], [100000, 2, 4]],
    double_decoder_consistency=True,
    epochs=1000,
    text_cleaner="phoneme_cleaners",
    phoneme_language="en-us",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    precompute_num_workers=8,
    print_step=10,
    print_eval=True,
    mixed_precision=False,
    output_path=output_path,
    datasets=[dataset_config],
    use_speaker_embedding=True,  # set this to enable multi-sepeaker training
    lr=0.0001,
    use_phonemes=True,
    test_sentences=["Aku lagi males.", "Males banget untuk kerja.", "Apa yang kamu lakukan jika kamu memiliki banyak uang?"]
)

## INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)

# INITIALIZE THE TOKENIZER
# Tokenizer is used to convert text to sequences of token IDs.
# If characters are not defined in the config, default characters are passed to the config
tokenizer, config = TTSTokenizer.init_from_config(config)

# LOAD DATA SAMPLES
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
# You can define your custom sample loader returning the list of samples.
# Or define your custom formatter and pass it to the `load_tts_samples`.
# Check `TTS.tts.datasets.load_tts_samples` for more details.
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

# init speaker manager for multi-speaker training
# it mainly handles speaker-id to speaker-name for the model and the data-loader
speaker_manager = SpeakerManager()
speaker_manager.set_ids_from_data(train_samples + eval_samples, parse_key="speaker_name")

# init model
model = Tacotron2(config, ap, tokenizer, speaker_manager)

# INITIALIZE THE TRAINER
# Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training,
# distributed training, etc.
trainer = Trainer(
    TrainerArgs(), config, output_path, model=model, train_samples=train_samples, eval_samples=eval_samples
)

# AND... 3,2,1... 🚀
trainer.fit()

To Reproduce

CUDA_VISIBLE_DEVICES=0 python train.py --continue_path /media/DATA-2/TTS/TTS_Coqui/TTS/NEW-October-25-2022_01+02PM-68cef28a

Expected behavior

No error when continue training

Logs

The error look like this:

 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:True
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:23.0
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 | > /media/DATA-2/TTS/TTS_Coqui/Baby/metadata.csv
 | > /media/DATA-2/TTS/TTS_Coqui/Halwa/metadata.csv
 | > Found 21379 files in /media/DATA-2/TTS/TTS_Coqui
 > Init speaker_embedding layer.
 > Using CUDA: True
 > Number of GPUs: 1
 > `speakers.pth` is saved to /media/DATA-2/TTS/TTS_Coqui/TTS/NEW-October-25-2022_01+02PM-68cef28a/speakers.pth.
 > `speakers_file` is updated in the config.json.
 > Restoring from checkpoint_1840000.pth ...
 > Restoring Model...
 > Partial model initialization...
 | > 97 / 106 layers are restored.
 > Model restored from step 1840000

 > Model has 56681716 parameters
 > Restoring best loss from best_model_1839901.pth ...
 > Starting with loaded last best loss 1.190857

 > Number of output frames: 2

 > EPOCH: 0/1000
 --> /media/DATA-2/TTS/TTS_Coqui/TTS/NEW-October-25-2022_01+02PM-68cef28a

> DataLoader initialization
| > Tokenizer:
    | > add_blank: False
    | > use_eos_bos: False
    | > use_phonemes: True
    | > phonemizer:
        | > phoneme language: en-us
        | > phoneme backend: gruut
| > Number of instances : 21166
 | > Preprocessing samples
 | > Max text length: 273
 | > Min text length: 4
 | > Avg text length: 90.38788623263724
 | 
 | > Max audio length: 1145718.0
 | > Min audio length: 11868.0
 | > Avg audio length: 556292.4324860625
 | > Num. instances discarded samples: 0
 | > Batch group size: 0.

 > TRAINING (2022-11-14 13:22:17) 
 ! Run is kept in /media/DATA-2/TTS/TTS_Coqui/TTS/NEW-October-25-2022_01+02PM-68cef28a
Traceback (most recent call last):
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/trainer/trainer.py", line 1492, in fit
    self._fit()
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/trainer/trainer.py", line 1476, in _fit
    self.train_epoch()
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/trainer/trainer.py", line 1255, in train_epoch
    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/trainer/trainer.py", line 1097, in train_step
    num_optimizers=len(self.optimizer) if isinstance(self.optimizer, list) else 1,
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/trainer/trainer.py", line 975, in _optimize
    outputs, loss_dict = self._model_train_step(batch, model, criterion)
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/trainer/trainer.py", line 931, in _model_train_step
    return model.train_step(*input_args)
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/models/tacotron2.py", line 327, in train_step
    outputs = self.forward(text_input, text_lengths, mel_input, mel_lengths, aux_input)
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/models/tacotron2.py", line 183, in forward
    embedded_speakers = self.speaker_embedding(aux_input["speaker_ids"])[:, None]
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 160, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/nn/functional.py", line 2183, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not NoneType

Environment

{
"CUDA": {
"GPU": [
"NVIDIA GeForce GTX 1660 Ti"
],
"available": true,
"version": "10.2"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.11.0+cu102",
"TTS": "0.6.1",
"numpy": "1.19.5"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.8.0",
"version": "#118~18.04.1-Ubuntu SMP Thu Mar 3 13:53:15 UTC 2022"
}
}

Additional context

No response

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.