coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
31.64k stars 3.78k forks source link

[Bug] Armenian Language model training fail #3766

Closed ghevond20 closed 2 weeks ago

ghevond20 commented 4 weeks ago

Describe the bug

I create custom Armenian dataset format is ljspeech. and get example from train.py use GlowTTS Model Training i change only my dataset path and language name "phoneme_language": "hy"

To Reproduce

import os

Trainer: Where the ✨️ happens.

TrainingArgs: Defines the set of arguments of the Trainer.

from trainer import Trainer, TrainerArgs

GlowTTSConfig: all model related values for training, validating and testing.

from TTS.tts.configs.glow_tts_config import GlowTTSConfig

BaseDatasetConfig: defines name, formatter and path of the dataset.

from TTS.tts.configs.shared_configs import BaseDatasetConfig from TTS.tts.datasets import load_tts_samples from TTS.tts.models.glow_tts import GlowTTS from TTS.tts.utils.text.tokenizer import TTSTokenizer from TTS.utils.audio import AudioProcessor from TTS.tts.utils.text.armenian.phonemizer import ArmenianPhonemizer

we use the same path as this script as our training folder.

output_path = os.path.dirname(os.path.abspath(file))

DEFINE DATASET CONFIG

Set LJSpeech as our target dataset and define its path.

You can also use a simple Dict to define the dataset and pass it to your custom formatter.

dataset_config = BaseDatasetConfig( formatter="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "/ArmenianGorcakatar") )

INITIALIZE THE TRAINING CONFIGURATION

Configure the model. Every config class inherits the BaseTTSConfig.

config = GlowTTSConfig( batch_size=8, eval_batch_size=16, num_loader_workers=14, num_eval_loader_workers=14, run_eval=True, test_delay_epochs=-1, epochs=1000, text_cleaner="phoneme_cleaners", use_phonemes=True, phoneme_language="hy", phoneme_cache_path=os.path.join(output_path, "phoneme_cache"), print_step=25, print_eval=False, mixed_precision=True, output_path=output_path, datasets=[dataset_config], )

INITIALIZE THE AUDIO PROCESSOR

Audio processor is used for feature extraction and audio I/O.

It mainly serves to the dataloader and the training loggers.

ap = AudioProcessor.init_from_config(config)

INITIALIZE THE TOKENIZER

Tokenizer is used to convert text to sequences of token IDs.

If characters are not defined in the config, default characters are passed to the config

phonemizer = ArmenianPhonemizer() tokenizer, config = TTSTokenizer.init_from_config(config)

LOAD DATA SAMPLES

Each sample is a list of [text, audio_file_path, speaker_name]

You can define your custom sample loader returning the list of samples.

Or define your custom formatter and pass it to the load_tts_samples.

Check TTS.tts.datasets.load_tts_samples for more details.

train_samples, eval_samples = load_tts_samples( dataset_config, eval_split=True, eval_split_max_size=config.eval_split_max_size, eval_split_size=config.eval_split_size, )

INITIALIZE THE MODEL

Models take a config object and a speaker manager as input

Config defines the details of the model like the number of layers, the size of the embedding, etc.

Speaker manager is used by multi-speaker models.

model = GlowTTS(config, ap, tokenizer, speaker_manager=None)

INITIALIZE THE TRAINER

Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training,

distributed training, etc.

trainer = Trainer( TrainerArgs(), config, output_path, model=model, train_samples=train_samples, eval_samples=eval_samples )

AND... 3,2,1... 🚀

trainer.fit()

Expected behavior

No response

Logs

No response

Environment

$ python3.9 collect_env_info.py 
{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 4060 Ti"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.3.0+cu121",
        "TTS": "0.22.0",
        "numpy": "1.26.4"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.9.5",
        "version": "#117~20.04.1-Ubuntu SMP Tue Apr 30 10:35:57 UTC 2024"
    }
}

Additional context

after start train get warning [!] Character 'ʰ' not found in the vocabulary. Discarding it. but when I check my dataset directly from Project$ espeak-ng -vhyw -q -x "Թաղեմ Կրնամ ապակի ուտել և ինծի անհանգիստ չըներ։" t#ar"'em g@rn'am ,abag'i ud'el j'ev indz'i ,anhank#'isd tS#@n'er

No such kay 'ʰ'

please help me find the problem where the key is generated 'ʰ' ?

eginhard commented 4 weeks ago

It's there, you passed the wrong flag to espeak:

$ espeak-ng -v hy -q --ipa "Թաղեմ Կրնամ ապակի ուտել և ինծի անհանգիստ չըներ։"
tʰaʀˈem kərnˈam ˌapakˈi utˈel jˈev intsˈi ˌanhanɡˈist tʃʰənˈer
ghevond20 commented 4 weeks ago

It's there, you passed the wrong flag to espeak:

$ espeak-ng -v hy -q --ipa "Թաղեմ Կրնամ ապակի ուտել և ինծի անհանգիստ չըներ։"
tʰaʀˈem kərnˈam ˌapakˈi utˈel jˈev intsˈi ˌanhanɡˈist tʃʰənˈer

Thanks for answer: [!] Character 'ʰ' not found in the vocabulary. Discarding it. many phonems use this 'ʰ' how to resolve this problem ?

ghevond20 commented 3 weeks ago

Problem is resolved i add char 'ʰ' in /TTS/tts/utils/text/characters.py on line _pulmonic_consonants = "pbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟʰ" And learn nice ) Thanks for answer ))