ghevond20 commented 4 weeks ago

Describe the bug

I create custom Armenian dataset format is ljspeech. and get example from train.py use GlowTTS Model Training i change only my dataset path and language name "phoneme_language": "hy"

To Reproduce

import os

Trainer: Where the ✨️ happens.

TrainingArgs: Defines the set of arguments of the Trainer.

from trainer import Trainer, TrainerArgs

GlowTTSConfig: all model related values for training, validating and testing.

from TTS.tts.configs.glow_tts_config import GlowTTSConfig

BaseDatasetConfig: defines name, formatter and path of the dataset.

from TTS.tts.configs.shared_configs import BaseDatasetConfig from TTS.tts.datasets import load_tts_samples from TTS.tts.models.glow_tts import GlowTTS from TTS.tts.utils.text.tokenizer import TTSTokenizer from TTS.utils.audio import AudioProcessor from TTS.tts.utils.text.armenian.phonemizer import ArmenianPhonemizer

we use the same path as this script as our training folder.

output_path = os.path.dirname(os.path.abspath(file))

DEFINE DATASET CONFIG

Set LJSpeech as our target dataset and define its path.

You can also use a simple Dict to define the dataset and pass it to your custom formatter.

dataset_config = BaseDatasetConfig( formatter="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "/ArmenianGorcakatar") )

INITIALIZE THE TRAINING CONFIGURATION

Configure the model. Every config class inherits the BaseTTSConfig.

config = GlowTTSConfig( batch_size=8, eval_batch_size=16, num_loader_workers=14, num_eval_loader_workers=14, run_eval=True, test_delay_epochs=-1, epochs=1000, text_cleaner="phoneme_cleaners", use_phonemes=True, phoneme_language="hy", phoneme_cache_path=os.path.join(output_path, "phoneme_cache"), print_step=25, print_eval=False, mixed_precision=True, output_path=output_path, datasets=[dataset_config], )

INITIALIZE THE AUDIO PROCESSOR

Audio processor is used for feature extraction and audio I/O.

It mainly serves to the dataloader and the training loggers.

ap = AudioProcessor.init_from_config(config)

INITIALIZE THE TOKENIZER

Tokenizer is used to convert text to sequences of token IDs.

If characters are not defined in the config, default characters are passed to the config

phonemizer = ArmenianPhonemizer() tokenizer, config = TTSTokenizer.init_from_config(config)

LOAD DATA SAMPLES

Each sample is a list of `[text, audio_file_path, speaker_name]`

You can define your custom sample loader returning the list of samples.

Or define your custom formatter and pass it to the `load_tts_samples`.

Check `TTS.tts.datasets.load_tts_samples` for more details.

train_samples, eval_samples = load_tts_samples( dataset_config, eval_split=True, eval_split_max_size=config.eval_split_max_size, eval_split_size=config.eval_split_size, )

INITIALIZE THE MODEL

Models take a config object and a speaker manager as input

Config defines the details of the model like the number of layers, the size of the embedding, etc.

Speaker manager is used by multi-speaker models.

model = GlowTTS(config, ap, tokenizer, speaker_manager=None)

INITIALIZE THE TRAINER

Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training,

distributed training, etc.

trainer = Trainer( TrainerArgs(), config, output_path, model=model, train_samples=train_samples, eval_samples=eval_samples )

AND... 3,2,1... 🚀

trainer.fit()

Expected behavior

No response

Logs

No response

Environment

$ python3.9 collect_env_info.py 
{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 4060 Ti"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.3.0+cu121",
        "TTS": "0.22.0",
        "numpy": "1.26.4"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.9.5",
        "version": "#117~20.04.1-Ubuntu SMP Tue Apr 30 10:35:57 UTC 2024"
    }
}

Additional context

after start train get warning [!] Character 'ʰ' not found in the vocabulary. Discarding it. but when I check my dataset directly from Project$ espeak-ng -vhyw -q -x "Թաղեմ Կրնամ ապակի ուտել և ինծի անհանգիստ չըներ։" t#ar"'em g@rn'am ,abag'i ud'el j'ev indz'i ,anhank#'isd tS#@n'er

No such kay 'ʰ'

please help me find the problem where the key is generated 'ʰ' ?

eginhard commented 4 weeks ago

It's there, you passed the wrong flag to espeak:

$ espeak-ng -v hy -q --ipa "Թաղեմ Կրնամ ապակի ուտել և ինծի անհանգիստ չըներ։"
tʰaʀˈem kərnˈam ˌapakˈi utˈel jˈev intsˈi ˌanhanɡˈist tʃʰənˈer

ghevond20 commented 4 weeks ago

It's there, you passed the wrong flag to espeak:

$ espeak-ng -v hy -q --ipa "Թաղեմ Կրնամ ապակի ուտել և ինծի անհանգիստ չըներ։"
tʰaʀˈem kərnˈam ˌapakˈi utˈel jˈev intsˈi ˌanhanɡˈist tʃʰənˈer

Thanks for answer: [!] Character 'ʰ' not found in the vocabulary. Discarding it. many phonems use this 'ʰ' how to resolve this problem ?

ghevond20 commented 3 weeks ago

Problem is resolved i add char 'ʰ' in /TTS/tts/utils/text/characters.py on line _pulmonic_consonants = "pbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟʰ" And learn nice ) Thanks for answer ))

coqui-ai / TTS

[Bug] Armenian Language model training fail #3766

Describe the bug

To Reproduce

Trainer: Where the ✨️ happens.

TrainingArgs: Defines the set of arguments of the Trainer.

GlowTTSConfig: all model related values for training, validating and testing.

BaseDatasetConfig: defines name, formatter and path of the dataset.

we use the same path as this script as our training folder.

DEFINE DATASET CONFIG

Set LJSpeech as our target dataset and define its path.

You can also use a simple Dict to define the dataset and pass it to your custom formatter.

INITIALIZE THE TRAINING CONFIGURATION

Configure the model. Every config class inherits the BaseTTSConfig.

INITIALIZE THE AUDIO PROCESSOR

Audio processor is used for feature extraction and audio I/O.

It mainly serves to the dataloader and the training loggers.

INITIALIZE THE TOKENIZER

Tokenizer is used to convert text to sequences of token IDs.

If characters are not defined in the config, default characters are passed to the config

LOAD DATA SAMPLES

Each sample is a list of `[text, audio_file_path, speaker_name]`

You can define your custom sample loader returning the list of samples.

Or define your custom formatter and pass it to the `load_tts_samples`.

Check `TTS.tts.datasets.load_tts_samples` for more details.

INITIALIZE THE MODEL

Models take a config object and a speaker manager as input

Config defines the details of the model like the number of layers, the size of the embedding, etc.

Speaker manager is used by multi-speaker models.

INITIALIZE THE TRAINER

Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training,

distributed training, etc.

AND... 3,2,1... 🚀

Expected behavior

Logs

Environment

Additional context

coqui-ai / TTS

[Bug] Armenian Language model training fail #3766

Describe the bug

To Reproduce

Trainer: Where the ✨️ happens.

TrainingArgs: Defines the set of arguments of the Trainer.

GlowTTSConfig: all model related values for training, validating and testing.

BaseDatasetConfig: defines name, formatter and path of the dataset.

we use the same path as this script as our training folder.

DEFINE DATASET CONFIG

Set LJSpeech as our target dataset and define its path.

You can also use a simple Dict to define the dataset and pass it to your custom formatter.

INITIALIZE THE TRAINING CONFIGURATION

Configure the model. Every config class inherits the BaseTTSConfig.

INITIALIZE THE AUDIO PROCESSOR

Audio processor is used for feature extraction and audio I/O.

It mainly serves to the dataloader and the training loggers.

INITIALIZE THE TOKENIZER

Tokenizer is used to convert text to sequences of token IDs.

If characters are not defined in the config, default characters are passed to the config

LOAD DATA SAMPLES

Each sample is a list of [text, audio_file_path, speaker_name]

You can define your custom sample loader returning the list of samples.

Or define your custom formatter and pass it to the load_tts_samples.

Check TTS.tts.datasets.load_tts_samples for more details.

INITIALIZE THE MODEL

Models take a config object and a speaker manager as input

Config defines the details of the model like the number of layers, the size of the embedding, etc.

Speaker manager is used by multi-speaker models.

INITIALIZE THE TRAINER

Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training,

distributed training, etc.

AND... 3,2,1... 🚀

Expected behavior

Logs

Environment

Additional context

Each sample is a list of `[text, audio_file_path, speaker_name]`

Or define your custom formatter and pass it to the `load_tts_samples`.

Check `TTS.tts.datasets.load_tts_samples` for more details.