coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
35.7k stars 4.37k forks source link

[Bug] in XTTSv1 and XTTSv2 i get error ❗ len(DataLoader) returns 0 in not-english #3229

Closed lpscr closed 1 year ago

lpscr commented 1 year ago

Describe the bug

hi first i want to thank you for all your amazing work !

please if you have little time please check this , i have also notebook easy to debug the problem

i follow the steps for recipes in xttsv1,xttsv2

https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v1 https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v2

AssertionError: ❗ len(DataLoader) returns 0. Make sure your dataset is not empty or len(dataset) > 0.

if i test on vits,glow working just fine with not any problem and i can complete the train https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/vits_tts

To Reproduce

i have create simple notebook for easy to see the problem , i have also fix the dataset i use from kaggle

testTTSV1_2.zip

Expected behavior

No response

Logs

> Training Environment:
 | > Backend: Torch
 | > Mixed precision: False
 | > Precision: float32
 | > Num. of CPUs: 2
 | > Num. of Torch Threads: 1
 | > Torch seed: 1
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 | > Torch TF32 MatMul: False
 > Start Tensorboard: tensorboard --logdir=/content/GPT_XTTS_LJSpeech_FT_new-November-15-2023_03+10PM-0000000

 > Model has 518128803 parameters

>> DVAE weights restored from: /content/XTTS_v1.1_original_model_files/dvae.pth
 | > Found 1844 files in /content/el

 > EPOCH: 0/1000
 --> /content/GPT_XTTS_LJSpeech_FT_new-November-15-2023_03+10PM-0000000
 ! Run is removed from /content/GPT_XTTS_LJSpeech_FT_new-November-15-2023_03+10PM-0000000

 > Filtering invalid eval samples!!
 > Total eval samples after filtering: 0
Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 1808, in fit
    self._fit()
  File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 1762, in _fit
    self.eval_epoch()
  File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 1610, in eval_epoch
    self.get_eval_dataloader(
  File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 976, in get_eval_dataloader
    return self._get_loader(
  File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 900, in _get_loader
    len(loader) > 0
AssertionError:  ❗ len(DataLoader) returns 0. Make sure your dataset is not empty or len(dataset) > 0.

Environment

{
    "CUDA": {
        "GPU": [],
        "available": false,
        "version": "11.8"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.0+cu118",
        "TTS": "0.20.5",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.12",
        "version": "#1 SMP Wed Aug 30 11:19:59 UTC 2023"
    }
}

Additional context

No response

erogol commented 1 year ago

How many samples do you have?

stlohrey commented 1 year ago

I would guess the problem is that the tokenizer does not support greek which causes error in loading your data, resulting in an empty data loader.

lpscr commented 1 year ago

How many samples do you have?

thank for your quick reply

first test like use in notebook

wav format
Channels: 1
Sample Width: 4
Frame Rate: 22050

files 
Minimum Duration: 1.237 seconds
Maximum Duration: 26.4 seconds
Total Duration: 04:03:40 

Total Files: 1820

second test 2

audio format
wav format
Channels: 1
Sample Width: 2
Frame Rate: 22050

files 
Minimum Duration: 1.0 seconds
Maximum Duration: 18.443 seconds
Total Duration: 09:20:49 

Total Files: 8526

this data i use with vits model and working fine with not any problem

the problem it's when i use with in xttsv1,xttsv2 i test both same problem

i try also to phonemizer or unidecode in the metadata.csv like say @stlohrey so it's not greek characters same problem not help :(

from this

Paramythi_horis_onoma_0065|- Τι είναι πάλι οι φωνές;|- Τι είναι πάλι οι φωνές;
to
Paramythi_horis_onoma_0065|tˈi ˈeːnaɪ pˈali ˈoɪ fɔːnˈes?|tˈi ˈeːnaɪ pˈali ˈoɪ fɔːnˈes?

and also i try unidecode from Greek to English so like this i have only English character and same problem again

from this
Paramythi_horis_onoma_0065|- Τι είναι πάλι οι φωνές;|- Τι είναι πάλι οι φωνές;
to
Paramythi_horis_onoma_0065|- Ti einai pali oi phones;|- Ti einai pali oi phones;

thank you for your time

let me know if you want to test something else

AIFSH commented 1 year ago

https://github.com/coqui-ai/TTS/issues/3206#issue-1990021754

same as my issue

stlohrey commented 1 year ago

i just reproduced your error by changing the language code on my working dataset to "grc" (as set in you notebook). So, try to change the language code together with the english characters, it should work.

lpscr commented 1 year ago

@stlohrey thank you ! look like this the problem with language code and characters Greek like you say this working but the problem it's i need phonemizer i don't think this be good to train like if i covert only to unidecode ,

@erogol can you please check this problem if possible to fix if i use grc or el language like say @stlohrey dont working , but in vits working fine can check this please

the problem it's here like you say

config_dataset = BaseDatasetConfig(
    formatter="ljspeech",
    dataset_name="ljspeech",
    path="/home/lpc/appPython/ttsNew/lora",
    meta_file_train="metadata_fix.csv",
    language="en", # <<<  if i use grc code or el dont working problem
) 

it's possible some how use phonemizer for better train like i do with vits ?

if use like this don't working again same problem

        text_cleaner="phoneme_cleaners",
        use_phonemes=True,
        phoneme_language="grc",
        phoneme_cache_path="phoneme_cache",    

method with unidecode working pass

Paramythi_horis_onoma_0065|- Τι είναι πάλι οι φωνές;|- Τι είναι πάλι οι φωνές;
to
Paramythi_horis_onoma_0065|- Ti einai pali oi phones;|- Ti einai pali oi phones;

method with phonemizer error len like this i use also with vits if possible use also with xttsv1,xttsv2 ?

Paramythi_horis_onoma_0065|- Τι είναι πάλι οι φωνές;|- Τι είναι πάλι οι φωνές;
to
Paramythi_horis_onoma_0065|tˈi ˈeːnaɪ pˈali ˈoɪ fɔːnˈes?|tˈi ˈeːnaɪ pˈali ˈoɪ fɔːnˈes?

thank you

stlohrey commented 1 year ago

i think for adding a new language, you would have to work on the tokenizer script to introduce text cleaners for the new language, but also on the vocab, because the language code is tokenized together with the text, and that means that you would need to train the gpt model on the new tokens. i also dont think ipa phonemes represented i the vocab.json file provided.

lpscr commented 1 year ago

how i can train gpt model with the new tokens ? can please give me more info about this

stlohrey commented 1 year ago

you would need to change the tokenizer vocab and model config, maybe add a text cleaner for your language, and then run the trainer. I don't know if transfer learning from the pretrained checkpoint with new or modified input tokens works or makes sense. I also dont know if you would have to finetune the quantizer and the hifidecoder with the new language audio. maybe @erogol can give some hints on that topic.

lpscr commented 1 year ago

@stlohrey thank you very much for all your the help :) understand i need wait some help from the

@erogol know you're working hard, and I really appreciate it. Any hints would be great how add new language or finetune in Greek with the power of new model xtts . i know there is already support in vits i just need to know , any chance Greek in xtts could be supported in the next release? Thanks again for your amazing work.

AIFSH commented 1 year ago

config_dataset = BaseDatasetConfig( formatter="ljspeech", dataset_name="ljspeech", path="/home/lpc/appPython/ttsNew/lora", meta_file_train="metadata_fix.csv", language="en", # <<< if i use grc code or el dont working problem )

I assign meta_file_val can run with Chinese dataset, maybe you can try.

lpscr commented 1 year ago

@AIFSH thank you , i test like you say and does not working same problem again .

brambox commented 1 year ago

You can train over any other that has the closest to your over time it mostly clear accents. I was kind of able to add bulgarian [bg]. By editing [TTS/tts/configs/xtts_config.py] then [TTS/tts/layers/xtts/tokenizer.py] and add it in vocab.json The problem it seems to still use some of existing and has some little accents. We need an ID with as cleaner of accents as possible. Maybe there is one if the developers can tell us? Im not sure if we ever get a train from 0 model.

lpscr commented 1 year ago

@brambox hi thank you for the info and your help,

1 . i go to [TTS/tts/configs/xtts_config.py] i add the language here in dictionary 2 . i go to [TTS/tts/layers/xtts/tokenizer.py] and here i change a lot stuff and everything look ok 3 . i go to vocab.json in the model download in folder v2 and when i try change something here i get error len because i change the dictionary so cant change anything there , i stuck here

now this i found and try to do it's with out change any file for original script

i just use already symbols in register inside in vocab.json

 "a": 14,
 "b": 15,
 "c": 16,
etc...

and i make simple

dictionary map for replace the character i need for the text i use

"α" : "a"
"β" : "v" 
"γ" : "g"
etc...

i wonder how you change the vocab.json because when i try add or change i get error let me know

and yes be great if someone of developers give more help ,

brambox commented 1 year ago

Try just adding the language without anything more put some random id and start training see the result that way.

lpscr commented 1 year ago

@brambox you mean to go inside to vocab.json ? and add the language ? can just write me this you add and where;line index to understand better if i change something or add i get error i use xttsv2

size mismatch for gpt.text_embedding.weight: copying a param with shape torch.Size([6153, 1024]) from checkpoint, the shape in current model is torch.Size([6154, 1024]).
        size mismatch for gpt.text_head.weight: copying a param with shape torch.Size([6153, 1024]) from checkpoint, the shape in current model is torch.Size([6154, 1024]).
        size mismatch for gpt.text_head.bias: copying a param with shape torch.Size([6153]) from checkpoint, the shape in current model is torch.Size([6154]).
brambox commented 1 year ago

It seems we are kind of doing work around to use new lang but it will still need to train over an existing one. So if for example do a change at { "id": 6152, "special": true, "content": "[ko]", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false } and then go to "[ko]": 6152 and change both to 'gr' it will start train Now you can use any other of languages ids same way. So just use the one closest to your lang accent. Maybe you can too add custom symbols but i haven't tested it. Also if for example you remove "[ko]": 6152 then you can use any id 6152+ here { "id": 6152, "special": true.... but it will still use the lang you removed even id is diff We need developers to actually support adding complete new lang. Or atleast give us some accent free ids we can train over.

lpscr commented 1 year ago

ok this working you can replace the lang like you say and train start. but also need to replace in metadata.csv the character with already in register vocab.json , other get error len(DataLoader) returns 0 this because there is not Greek characters in vocab.json and if you try add,or replace already symbols like "a": 14 with "α": 14 you get error, so i guest no need to change anything in script in my case don't working so the only i can do it's just replace in metadata.csv only with English character and the train start with lang "en" like in my preview messages like i say with unidecode

i wonders with this method you try how it's the results in your case the lang accent it's clear and how many hours dataset you use and can you complare with vits ? to see if this working

i guest with need wait someone help for this like you say .

brambox commented 1 year ago

Overall its pretty decent. The accents mostly clear over time just little bit.
The bigest problem i find for bulgarian is the stress for words is sometimes applied wrong even if the word is in the dataset. Sometimes it say it right some time wrong or sometimes its always wrong specially if its not in the dataset. I wish there is some way to hard force stress position. Also we have words that are writen same but can mean different things based on stress so some kind of easyer control for it will be big help. It is definetly much more natural than vits. I tested different sizes but i stuck to around 2 and half to 3 hours. And noticed the longer datasets not always result in better model.

lpscr commented 1 year ago

yes same here more hours make worst the stuff i try with 4 hours and then 12 hours also i have a lot time repeat same stuff some time listen with English accents sometime, very slow and fast and noise , so like i see this no working for my Languase and look like also like you say in yours , and yes sometime when it's working it's listen natural better for the model vits this i dont know if possible somehow to make work in this version xtts or need wait next release , i hope someone help more on this

i have some question to know i do correct the stuff

1 how many hours need to finetune ( i try 4 - 12 hours)
2 the audio file min duration and max duration with can use (for me i use min 1 sec max 11 sec 22055 mono with clear speak with not noise)
3 how many steps about need  (300k about 1 day train in my case something)
4 with can use multi speakers this help the train or single speaker better only to use , what best to use (i try single and multi-speakers  ) 
5 can replace the symbol English with  the character like unidecode (i use unidecode method and also replace some character to work correct in my lang  )
6 what model it's best to use xtts v1,v2 (i use only last model v2)
7 can train for the begin with out use any checkpoints  like in vits

thank you for your time

78Alpha commented 1 year ago

Actually ran into this issue with an english dataset. Always filters down to 0.

arbianqx commented 1 year ago

I ran into the same problem aswell. Tried the upwards manually fixings, with no luck.

Edresson commented 1 year ago

Hi @lpscr,

Hi @arbianqx,

This message "> Total eval samples after filtering: 0" indicates that you don't have any eval samples that meet the training requeriments. It can be caused by three reasons:

  1. The Eval CSV that you provided is empty;
  2. The samples on the eval CSV that you provided are bigger than the max_wav_len and max_text_len defined on the recipe (https://github.com/coqui-ai/TTS/blob/dev/recipes/ljspeech/xtts_v2/train_gpt_xtts.py#L86C1-L87C29). Note that you do not recommend the changes of these values for fine-tuning;
  3. You do not provide an Eval CSV and all the samples automatically selected are bigger than max_wav_length and max_text_length.

In all these scenarios, you need to change (or create) your eval CSV to meet the requirements for training.

In your case looks like the issue is 3. You didn't provide a eval CSV and all the samples automatically selected are too long. In that way, all the evaluation samples will be filtered. I would recommend you create an eval CSV file using part of your data (like 15%), making sure that the samples respect the max_wav_length (11 seconds) and max_text_length (200 tokens).

Alternatively, the PR https://github.com/coqui-ai/TTS/pull/3296 implements a gradio demo for data processing plus training and inference for XTTS model. On the PR, have also have a Google Colab and soon we will do a video showing how to use the demo.

erogol commented 1 year ago

Feel free to reopen if the comment above doesn't help.

barghavanii commented 9 months ago

@erogol @Edresson Hi , I intended to finetune xtts_v2 for Persian language and for this according to above comment they mentioned 2-4 hours of dataset is ok to fine tune the model but you didn't answer whether it is fine or not .can you advise me asap this is very Urgent for me #https://github.com/coqui-ai/TTS/issues/3229#issuecomment-1817953283