Closed lpscr closed 1 year ago
How many samples do you have?
I would guess the problem is that the tokenizer does not support greek which causes error in loading your data, resulting in an empty data loader.
How many samples do you have?
thank for your quick reply
first test like use in notebook
wav format
Channels: 1
Sample Width: 4
Frame Rate: 22050
files
Minimum Duration: 1.237 seconds
Maximum Duration: 26.4 seconds
Total Duration: 04:03:40
Total Files: 1820
second test 2
audio format
wav format
Channels: 1
Sample Width: 2
Frame Rate: 22050
files
Minimum Duration: 1.0 seconds
Maximum Duration: 18.443 seconds
Total Duration: 09:20:49
Total Files: 8526
this data i use with vits model and working fine with not any problem
the problem it's when i use with in xttsv1,xttsv2 i test both same problem
i try also to phonemizer or unidecode in the metadata.csv like say @stlohrey so it's not greek characters same problem not help :(
from this
Paramythi_horis_onoma_0065|- Τι είναι πάλι οι φωνές;|- Τι είναι πάλι οι φωνές;
to
Paramythi_horis_onoma_0065|tˈi ˈeːnaɪ pˈali ˈoɪ fɔːnˈes?|tˈi ˈeːnaɪ pˈali ˈoɪ fɔːnˈes?
and also i try unidecode from Greek to English so like this i have only English character and same problem again
from this
Paramythi_horis_onoma_0065|- Τι είναι πάλι οι φωνές;|- Τι είναι πάλι οι φωνές;
to
Paramythi_horis_onoma_0065|- Ti einai pali oi phones;|- Ti einai pali oi phones;
thank you for your time
let me know if you want to test something else
https://github.com/coqui-ai/TTS/issues/3206#issue-1990021754
same as my issue
i just reproduced your error by changing the language code on my working dataset to "grc" (as set in you notebook). So, try to change the language code together with the english characters, it should work.
@stlohrey thank you ! look like this the problem with language code and characters Greek like you say this working but the problem it's i need phonemizer i don't think this be good to train like if i covert only to unidecode ,
@erogol can you please check this problem if possible to fix if i use grc or el language like say @stlohrey dont working , but in vits working fine can check this please
the problem it's here like you say
config_dataset = BaseDatasetConfig(
formatter="ljspeech",
dataset_name="ljspeech",
path="/home/lpc/appPython/ttsNew/lora",
meta_file_train="metadata_fix.csv",
language="en", # <<< if i use grc code or el dont working problem
)
it's possible some how use phonemizer for better train like i do with vits ?
if use like this don't working again same problem
text_cleaner="phoneme_cleaners",
use_phonemes=True,
phoneme_language="grc",
phoneme_cache_path="phoneme_cache",
method with unidecode working pass
Paramythi_horis_onoma_0065|- Τι είναι πάλι οι φωνές;|- Τι είναι πάλι οι φωνές;
to
Paramythi_horis_onoma_0065|- Ti einai pali oi phones;|- Ti einai pali oi phones;
method with phonemizer error len like this i use also with vits if possible use also with xttsv1,xttsv2 ?
Paramythi_horis_onoma_0065|- Τι είναι πάλι οι φωνές;|- Τι είναι πάλι οι φωνές;
to
Paramythi_horis_onoma_0065|tˈi ˈeːnaɪ pˈali ˈoɪ fɔːnˈes?|tˈi ˈeːnaɪ pˈali ˈoɪ fɔːnˈes?
thank you
i think for adding a new language, you would have to work on the tokenizer script to introduce text cleaners for the new language, but also on the vocab, because the language code is tokenized together with the text, and that means that you would need to train the gpt model on the new tokens. i also dont think ipa phonemes represented i the vocab.json file provided.
how i can train gpt model with the new tokens ? can please give me more info about this
you would need to change the tokenizer vocab and model config, maybe add a text cleaner for your language, and then run the trainer. I don't know if transfer learning from the pretrained checkpoint with new or modified input tokens works or makes sense. I also dont know if you would have to finetune the quantizer and the hifidecoder with the new language audio. maybe @erogol can give some hints on that topic.
@stlohrey thank you very much for all your the help :) understand i need wait some help from the
@erogol know you're working hard, and I really appreciate it. Any hints would be great how add new language or finetune in Greek with the power of new model xtts . i know there is already support in vits i just need to know , any chance Greek in xtts could be supported in the next release? Thanks again for your amazing work.
config_dataset = BaseDatasetConfig( formatter="ljspeech", dataset_name="ljspeech", path="/home/lpc/appPython/ttsNew/lora", meta_file_train="metadata_fix.csv", language="en", # <<< if i use grc code or el dont working problem )
I assign meta_file_val can run with Chinese dataset, maybe you can try.
@AIFSH thank you , i test like you say and does not working same problem again .
You can train over any other that has the closest to your over time it mostly clear accents. I was kind of able to add bulgarian [bg]. By editing [TTS/tts/configs/xtts_config.py] then [TTS/tts/layers/xtts/tokenizer.py] and add it in vocab.json The problem it seems to still use some of existing and has some little accents. We need an ID with as cleaner of accents as possible. Maybe there is one if the developers can tell us? Im not sure if we ever get a train from 0 model.
@brambox hi thank you for the info and your help,
1 . i go to [TTS/tts/configs/xtts_config.py] i add the language here in dictionary 2 . i go to [TTS/tts/layers/xtts/tokenizer.py] and here i change a lot stuff and everything look ok 3 . i go to vocab.json in the model download in folder v2 and when i try change something here i get error len because i change the dictionary so cant change anything there , i stuck here
now this i found and try to do it's with out change any file for original script
i just use already symbols in register inside in vocab.json
"a": 14,
"b": 15,
"c": 16,
etc...
and i make simple
dictionary map for replace the character i need for the text i use
"α" : "a"
"β" : "v"
"γ" : "g"
etc...
i wonder how you change the vocab.json because when i try add or change i get error let me know
and yes be great if someone of developers give more help ,
Try just adding the language without anything more put some random id and start training see the result that way.
@brambox you mean to go inside to vocab.json ? and add the language ? can just write me this you add and where;line index to understand better if i change something or add i get error i use xttsv2
size mismatch for gpt.text_embedding.weight: copying a param with shape torch.Size([6153, 1024]) from checkpoint, the shape in current model is torch.Size([6154, 1024]).
size mismatch for gpt.text_head.weight: copying a param with shape torch.Size([6153, 1024]) from checkpoint, the shape in current model is torch.Size([6154, 1024]).
size mismatch for gpt.text_head.bias: copying a param with shape torch.Size([6153]) from checkpoint, the shape in current model is torch.Size([6154]).
It seems we are kind of doing work around to use new lang but it will still need to train over an existing one. So if for example do a change at { "id": 6152, "special": true, "content": "[ko]", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false } and then go to "[ko]": 6152 and change both to 'gr' it will start train Now you can use any other of languages ids same way. So just use the one closest to your lang accent. Maybe you can too add custom symbols but i haven't tested it. Also if for example you remove "[ko]": 6152 then you can use any id 6152+ here { "id": 6152, "special": true.... but it will still use the lang you removed even id is diff We need developers to actually support adding complete new lang. Or atleast give us some accent free ids we can train over.
ok this working you can replace the lang like you say and train start. but also need to replace in metadata.csv the character with already in register vocab.json , other get error len(DataLoader) returns 0 this because there is not Greek characters in vocab.json and if you try add,or replace already symbols like "a": 14 with "α": 14 you get error, so i guest no need to change anything in script in my case don't working so the only i can do it's just replace in metadata.csv only with English character and the train start with lang "en" like in my preview messages like i say with unidecode
i wonders with this method you try how it's the results in your case the lang accent it's clear and how many hours dataset you use and can you complare with vits ? to see if this working
i guest with need wait someone help for this like you say .
Overall its pretty decent. The accents mostly clear over time just little bit.
The bigest problem i find for bulgarian is the stress for words is sometimes applied wrong even if the word is in the dataset. Sometimes it say it right some time wrong or sometimes its always wrong specially if its not in the dataset.
I wish there is some way to hard force stress position. Also we have words that are writen same but can mean different things based on stress so some kind of easyer control for it will be big help.
It is definetly much more natural than vits.
I tested different sizes but i stuck to around 2 and half to 3 hours. And noticed the longer datasets not always result in better model.
yes same here more hours make worst the stuff i try with 4 hours and then 12 hours also i have a lot time repeat same stuff some time listen with English accents sometime, very slow and fast and noise , so like i see this no working for my Languase and look like also like you say in yours , and yes sometime when it's working it's listen natural better for the model vits this i dont know if possible somehow to make work in this version xtts or need wait next release , i hope someone help more on this
i have some question to know i do correct the stuff
1 how many hours need to finetune ( i try 4 - 12 hours)
2 the audio file min duration and max duration with can use (for me i use min 1 sec max 11 sec 22055 mono with clear speak with not noise)
3 how many steps about need (300k about 1 day train in my case something)
4 with can use multi speakers this help the train or single speaker better only to use , what best to use (i try single and multi-speakers )
5 can replace the symbol English with the character like unidecode (i use unidecode method and also replace some character to work correct in my lang )
6 what model it's best to use xtts v1,v2 (i use only last model v2)
7 can train for the begin with out use any checkpoints like in vits
thank you for your time
Actually ran into this issue with an english dataset. Always filters down to 0.
I ran into the same problem aswell. Tried the upwards manually fixings, with no luck.
Hi @lpscr,
Hi @arbianqx,
This message "> Total eval samples after filtering: 0" indicates that you don't have any eval samples that meet the training requeriments. It can be caused by three reasons:
max_wav_len
and max_text_len
defined on the recipe (https://github.com/coqui-ai/TTS/blob/dev/recipes/ljspeech/xtts_v2/train_gpt_xtts.py#L86C1-L87C29). Note that you do not recommend the changes of these values for fine-tuning;max_wav_length
and max_text_length
. In all these scenarios, you need to change (or create) your eval CSV to meet the requirements for training.
In your case looks like the issue is 3. You didn't provide a eval CSV and all the samples automatically selected are too long. In that way, all the evaluation samples will be filtered. I would recommend you create an eval CSV file using part of your data (like 15%), making sure that the samples respect the max_wav_length
(11 seconds) and max_text_length
(200 tokens).
Alternatively, the PR https://github.com/coqui-ai/TTS/pull/3296 implements a gradio demo for data processing plus training and inference for XTTS model. On the PR, have also have a Google Colab and soon we will do a video showing how to use the demo.
Feel free to reopen if the comment above doesn't help.
@erogol @Edresson Hi , I intended to finetune xtts_v2 for Persian language and for this according to above comment they mentioned 2-4 hours of dataset is ok to fine tune the model but you didn't answer whether it is fine or not .can you advise me asap this is very Urgent for me #https://github.com/coqui-ai/TTS/issues/3229#issuecomment-1817953283
Describe the bug
hi first i want to thank you for all your amazing work !
please if you have little time please check this , i have also notebook easy to debug the problem
i follow the steps for recipes in xttsv1,xttsv2
https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v1
https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v2
AssertionError: ❗ len(DataLoader) returns 0. Make sure your dataset is not empty or len(dataset) > 0.
if i test on vits,glow working just fine with not any problem and i can complete the train
https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/vits_tts
To Reproduce
i have create simple notebook for easy to see the problem , i have also fix the dataset i use from kaggle
testTTSV1_2.zip
Expected behavior
No response
Logs
Environment
Additional context
No response