SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
https://arxiv.org/abs/2410.06885
MIT License
7.04k stars 830 forks source link

Attempting to train Cyrillic based language: Mongolian and findings #464

Open amarbayar opened 16 hours ago

amarbayar commented 16 hours ago

Checks

Question details

After reading some successful cases and promising results (like the French model release in https://github.com/SWivid/F5-TTS/issues/434 by @RASPIAUDIO, Youtube videos posted by @JarodMica) I have decided to give it a try to train a model for the Mongolian language. Mongolian language is based off of Cyrillic, so has a similar alphabet to Russian but has 2 additional letters. Pronunciation differs, though. And coming from software engineering background, without much experience on building ML models, has been posing the fair share of challenges. I am creating this issue to provide what I have tried and ask questions for guidances from folks who might be able to guide me in the right direction as I am not seeing satisfactory results so far.

Hardware specs:

Dataset:

Gradio App

Followed the Installation steps, downloaded and installed the dependencies/packages/libraries and ran the Gradio app for fine tuning.

The following are various scenarios I have tried.

Results so far

I have been testing the checkpoints at various intervals and haven't gotten satisfactory results. The voice does sound very similar to the original sound of the speaker provided in the dataset. But:

  1. Words are jibberish
  2. It is too fast (with no breaks...etc) as if it is trying a tongue-twister.

Loss

I think the results I am seeing aligns well with the Loss graph that I see on Tensorboard as well. Below is from my latest training attempt with 19,200 batch size per GPU, 32 max samples, 50 epochs, 300 warmup updates, 400 save per updates, 200 last per steps, fp16, f5tts_base model, char tokenizer, learning rate of 0.00001 and batch size type: frame.

Loss Rate:

Screenshot 2024-11-14 at 10 31 40 AM

Learning Rate Decay:

image

Reference audio: https://whyp.it/tracks/225389/af143?token=s8DdA Checkpoint inference: https://whyp.it/tracks/225391/af143-ckpt?token=i4bi0

Thoughts / Questions

I have read in a few issues where @SWivid for example mentions that at 200K steps or beyond, we might get some intelliglbe audio. But I have also seen examples where @RASPIAUDIO in his Youtube video tutorial had roughly 200-300 samples and the sample checkpoint was not too bad and there was an improvement which could then be a trigger to collect more dataset and spend more time training. In my case, I am not seeing much improvements even with nearly 4000 samples (I know it is small) and 6 hours length of a single speaker in studio quality. At least if I hear some words proncounced correctly or close enough, proper gaps and pauses then I could look at more datasets...etc.

So, could it be that:

  1. There is an inherent issue with the language itself?
  2. Or possible issue with cyrillic? (I do see the vocab.txt already had Cyrillic symbols)
  3. Should I just not expect to see loss decrease at 7000th step mark and just aim to get 200K steps at min?
  4. Should I try different settings? (I did try the Auto Settings as well)
  5. Or should I keep adding more and more data set and iterate this?

Any of these could be like a shot in the dark and I am trying to understand if there is a systematic way to approach it so that I can rule-in or rule-out hypothesis early on?

SWivid commented 14 hours ago

200K steps or beyond

it's for training from scratch, say pretraining; finetuning will allow fewer steps to hear something. As Cyrillic is somehow different from pretrained distribution, e.g. we know French and English is close to each other in some way, it would take more time to learn pronunciation.

Possible improvement:

  1. grapheme-to-phoneme tokenization
  2. change the tokenizer like "Г", "э", "р", "э", "л", " ", "с", ...,], having words together
JarodMica commented 13 hours ago
  • Tokenizer example:
['   ' , ' н ' , '   ' , ' э ' , '   ' , ' г ' , '   ' , ' и ' , '   ' , ' й ' , '   ' , ' н ' , '   ' , ' х ' , '   ' , '   ' , ' н ' , '   ' , ' ь ' , '   ' , '   ' , ' н ' , '   ' , ' э ' , '   ' , ' р ' , '   ' , '   ' , ' а ' , '   ' , ' д ' , '   ' , ' а ' , ' , ' , '   ' , '   ' , ' н ' , '   ' , ' ө ' , '   ' , ' г ' , '   ' , ' ө ' , '   ' , ' ө ' , '   ' , ' г ' , '   ' , ' и ' , '   ' , ' й ' , '   ' , ' н ' , '   ' , ' х ' , '   ' , '   ' , ' н ' , '   ' , ' ь ' , '   ' , '   ' , ' н ' , '   ' , ' э ' , '   ' , ' р ' , '   ' , '   ' , ' з ' , '   ' , ' и ' , '   ' , ' л ' , '   ' , ' л ' , '   ' , ' а ' , ' . ']

This is probably 99% the issue right here. You'll want to modify convert_to_pinyin to not add spaces between Cyrillic. I saw the same issue with jumbled words in Japanese because it wants to add the extra space:

Sample modification:

def convert_char_to_pinyin(text_list, polyphone=True):
    final_text_list = []
    zh_quote_trans = str.maketrans({'“': '"', '”': '"', '‘': "'", '’': "'"})
    custom_trans = str.maketrans({';': ','})

    def is_japanese(c):
        return (
            '\u3040' <= c <= '\u309F' or  # Hiragana
            '\u30A0' <= c <= '\u30FF' or  # Katakana
            '\uFF66' <= c <= '\uFF9F'     # Half-width Katakana
        )

    for text in text_list:
        char_list = []
        text = text.translate(zh_quote_trans)
        text = text.translate(custom_trans)
        for seg in jieba.cut(text):
            seg_byte_len = len(seg.encode('utf-8'))
            if seg_byte_len == len(seg):
                if char_list and seg_byte_len > 1 and char_list[-1] not in " :'\"":
                    char_list.append(" ")
                char_list.extend(seg)
            elif polyphone and seg_byte_len == 3 * len(seg):
                seg_pinyin = lazy_pinyin(seg, style=Style.TONE3, tone_sandhi=True)
                for p in seg_pinyin:
                    if p not in "。,、;:?!《》【】—…":
                        if not char_list or not is_japanese(char_list[-1]):
                            char_list.append(" ")
                    char_list.append(p)
            else:
                for c in seg:
                    if ord(c) < 256:
                        char_list.append(c)
                    elif is_japanese(c):
                        char_list.append(c)
                    else:
                        if c not in "。,、;:?!《》【】—…":
                            if not char_list or not is_japanese(char_list[-1]):
                                char_list.append(" ")
                            pinyin = lazy_pinyin(c, style=Style.TONE3, tone_sandhi=True)
                            char_list.extend(pinyin)
                        else:
                            char_list.append(c)
        final_text_list.append(char_list)
    return final_text_list

This only affects inference so the model should've trained fine, once you adapt it to Cyrillic, I guesstimate the output will improve.

amarbayar commented 11 hours ago

Okay. Thanks to both of you. Will try in the following order and see what happens. If I run into any issues or make good progress, I will share my findings here.

  1. Remove the logic that is adding additional space between letters/symbols if it is not Chinese such that tokenized text is represented like:

    "Г", "э", "р", "э", "л", " ", "с", ...,]
  2. If this doesn't yield good results, will try the phoneme approach. I. have found: https://github.com/xinjli/transphone. The rough idea I have is that I can use it to phonemize my dataset text, extract the unique symbols and add them to the pretrained vocab by extending. And I will need to implement a logic that converts each cyrillic character to the phoneme so that it can do a lookup from the vobab.txt. But will get to it after I experiment the first / faster approach.