[Bug] Error while Fine-Tuning TTS for Japanese Language

mahimairaja commented 7 months ago

Describe the bug

It seems that there is hidden issue behind the dataset preparation for fine-tuning TTS on Japanese Language

To Reproduce

Clone the repo and install the pacakges.

> git clone --branch xtts_demo -q https://github.com/coqui-ai/TTS.git

> pip install --use-deprecated=legacy-resolver -q -e TTS

> pip install --use-deprecated=legacy-resolver -q -r TTS/TTS/demos/xtts_ft_demo/requirements.txt

> pip install -q typing_extensions==4.8 numpy==1.26.2

Launch the Fine-Tuning GUI

>  python TTS/TTS/demos/xtts_ft_demo/xtts_demo.py

Add few Japanese Speech Audio samples to the dataset processing and click Create Dataset
Move to the fine-tuning tab and run the training

And the error message pops up:

The training was interrupted due an error !! Please check the console to check the full error message! Error summary: Traceback (most recent call last): File "/content/TTS/TTS/demos/xtts_ft_demo/xtts_demo.py", line 284, in train_model config_path, original_xtts_checkpoint, vocab_file, exp_path, speaker_wav = train_gpt(language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path=output_path, max_audio_length=max_audio_length) File "/content/TTS/TTS/demos/xtts_ft_demo/utils/gpt_train.py", line 138, in train_gpt train_samples, eval_samples = load_tts_samples( File "/content/TTS/TTS/tts/datasets/__init__.py", line 121, in load_tts_samples assert len(meta_data_train) > 0, f" [!] No training samples found in {root_path}/{meta_file_train}" AssertionError: [!] No training samples found in /tmp/xtts_ft/dataset//tmp/xtts_ft/dataset/metadata_train.csv

Expected behavior

The fine-tuning process should run, without interpretation.

Logs

>> DVAE weights restored from: /tmp/xtts_ft/run/training/XTTS_v2.0_original_model_files/dvae.pth
Traceback (most recent call last):
  File "/content/TTS/TTS/demos/xtts_ft_demo/xtts_demo.py", line 284, in train_model
    config_path, original_xtts_checkpoint, vocab_file, exp_path, speaker_wav = train_gpt(language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path=output_path, max_audio_length=max_audio_length)
  File "/content/TTS/TTS/demos/xtts_ft_demo/utils/gpt_train.py", line 138, in train_gpt
    train_samples, eval_samples = load_tts_samples(
  File "/content/TTS/TTS/tts/datasets/__init__.py", line 121, in load_tts_samples
    assert len(meta_data_train) > 0, f" [!] No training samples found in {root_path}/{meta_file_train}"
AssertionError:  [!] No training samples found in /tmp/xtts_ft/dataset//tmp/xtts_ft/dataset/metadata_train.csv

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla T4"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.0+cu121",
        "TTS": "0.20.6",
        "numpy": "1.26.2"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.12",
        "version": "#1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023"
    }
}

Additional context

No response

jianchang512 commented 7 months ago

The same error occurs in Chinese, the data preprocessing function doesn't seem to work with CJK characters.

mahimairaja commented 7 months ago

Alright, does anyone already working on this issue?

rose07 commented 7 months ago

This website is also owned by Microsoft. You can give it a try

https://tts.byylook.com/ai/text-to-speech

zaher-m commented 5 months ago

This error message AssertionError: [!] No training samples found in /tmp/xtts_ft/dataset//tmp/xtts_ft/dataset/metadata_train.csv happens because the dataset processing didn't generate any dataset on which the fine-tuning process (next tab) relies.
Your dataset directory should have the following structure after the dataset processing is done.

where wavs directory contains all dataset divided into clips and metadata_eval.csv, metadata_train.csv maps these clips with their corresponding transcription or text see below where Arabic voices were used.

Check the quality of the input data. Try to provide high-quality audio files this helps in data processing.
Provide more samples of input data.
If you're using Whisper model to do the ASR process. Try a larger version of it.

stale[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

Utk-bot commented 4 days ago

@jianchang512 @rose07 @zaher-m @zaher-m Can you provide code for fine tune XTTSv-2 please

coqui-ai / TTS