KdaiP / StableTTS

Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3
MIT License
361 stars 43 forks source link

Adding New Language and IPA Symbols in Model Training #22

Open lpscr opened 1 month ago

lpscr commented 1 month ago

Hi, thank you so much for the amazing repo—it's really very cool!

I'm trying to add a new language, but I encountered an issue with IPA symbols. Specifically, 5 letters are missing. I checked symbols.py and found some symbols were unused. Here’s what I used:

IPA_letters = 'NQabdefghijklmnopstuvwxyzɑæʃʑçɯɪɔɛɹðəɫɥɸʊɾʒθβŋɦ⁼ʰ^#*=ˈˌ→↓↑ ' ` I used phonemizer to generate phonemes for my language and replaced the missing symbols with: here in my case

"r" : "ɾ",
"ɣ" : "g",
"ɲ" : "h",
"c" : "ɔ",
"ɡ" : "g",
"ʎ" : "ɦ"

this correct method ? or how i can add more symbols like 5 i need ? when i did this i get error

After ensuring all symbols were accounted for, I trained the model using a pre-trained checkpoint_0.pt model for fine-tuning over 40 hours. The model can produce speech, so I assume the symbol replacement worked i guest. However, the timing is off—the speech sounds bad, with incorrect word speed, though the sound quality is okay. not noise or robot

I used the pre-trained model for fine-tuning by copying it to the checkpoint folder and starting the training. I haven't trained the model from scratch yet, as I think it would take too long.

need change something in config ? like lear rate?

here results of the train

about 4 hours train Rank 0, Epoch 24, Loss 2.3834068775177

image Do I need more steps to fix the timing issue?

I would really appreciate any help with this!

KdaiP commented 1 month ago

Hi, thanks for trying StableTTS! Could you upload some problematic audio samples? From the screenshots you provided, the losses seem to be relatively normal. Generally speaking, a dur_loss of 0.34 should lead to fairly good duration prediction results.

lpscr commented 1 month ago

"Hey, thank you for the quick reply. I noticed the same problem in the original pre-trained model as well. For example, if you check this link: demo you'll see that the speech has unnatural pauses within words, and the speed isn't correct.

This issue also occurs in my language, where the speech sounds similarly unnatural pauses within word. using the demo by selecting 'wav5' and writing something. Hey , now you mentioned that actually makes sense. We must have faith that he will return when he is ready.

You'll notice that the words sound very has unnatural pauses within words, and the speed isn't correct.

I've also attached a sample file for your reference:

test.zip

i have test a lot other tts repo and i can say this best and my favourite, I ran some tests, and I must say you did a great job! The preparation and training process is very easy to use. This is one of the best sources I've seen—clear, user-friendly, and impressively fast. The only problem I've encountered is the unnatural pauses in the speech. Thank you very much for all your hard work. How can I fix these unnatural pauses?

KdaiP commented 1 month ago

Hi, thank you for sharing your feedback. After listening to the sample you provided, as well as running my own tests, I can confirm that there are indeed unnatural pauses in the speech, especially when using IPA phones. The model needs further update and enhancement.

Fortunately, this isn't the worst case. In the Stable TTS 1.0 version, there was an occasional issue with reference audio leakage, which resulted in the generated audio being completely unintelligible. This issue was addressed and fixed in version 1.1.

I suspect that the current issue with unnatural pauses may be due to insufficient training data using IPA, especially when compared to languages like Chinese. I'll try to train a new version using English and Japanese IPA symbols to see if this improves the situation in the next few weeks. (or there still be some bugs in the code)

Additionally, adjusting inference parameters might help alleviate the problem. For example: Set the Temperature to 0.667 Set CFG to 1.7

Thanks again for your feedback, and I'll keep you updated on any progress!

lpscr commented 1 month ago

Thank you very much! I'm really impressed by how quickly the model learned and how fast after generate the speech i think the most faster tts i ever seen—so cool! I've been running a lot of tests, and right now I'm training the model from scratch. I'll keep testing and let you know how it goes. I'll also try out the values you gave me.

I'm looking forward to the new model when it's ready. If you can fix this problem and achieve smooth and natural speech, it will be the best top TTS!

Thanks again for this great repo—it's amazing!

lpscr commented 1 month ago

"Hi, @KdaiP I did some testing, but not much because I was working with Gradio here i have a review and other things. if i changing the model configuration might improve it ? . What do you think I should change if I want to train the model from scratch?" let me know also is enouth about 100 hours multi speaker ?

class ModelConfig:
    hidden_channels: int = 256
    filter_channels: int = 1024
    n_heads: int = 4
    n_enc_layers: int = 3 
    n_dec_layers: int = 6 
    kernel_size: int = 3
    p_dropout: int = 0.1
    gin_channels: int = 256

thank you

KdaiP commented 1 month ago

Hi, for a 100-hour dataset, the default parameters should be sufficient. However, if you're planning to scale up the dataset, say to around 2000 hours, you could consider increasing the hidden_channels, filter_channels, and n_dec_layers. Generally, models with around 70-80M parameters tend to perform well.

I wouldn't recommend increasing the gin_channels, as larger values might cause reference audio leakage, leading to issues with pronunciation and prosody.

Hope this helps!

lpscr commented 1 month ago

I've finished setting up the Gradio web UI i think now all look ok with last update i make and look great and now I'm focusing on gin_channels.

Do I need to resample my audio? in dataset in preprocess.py file you have resample false but if audio it's not 44100 ? need resample ? also need to know so i remove it in webui

I use a sample rate of 24,000 Hz minimum length of 1 second
maximum length of 10 second. samples about : 54.000

Does using more 10-second samples improve training? I mean, should I use files that are 7-10 seconds only long instead of shorter files?