Attempting to train Cyrillic based language: Mongolian and findings

amarbayar commented 16 hours ago

Checks

[X] This template is only for question, not feature requests or bug reports.
[X] I have thoroughly reviewed the project documentation and read the related paper(s).
[X] I have searched for existing issues, including closed ones, no similar questions.
[X] I confirm that I am using English to submit this report in order to facilitate communication.

Question details

After reading some successful cases and promising results (like the French model release in https://github.com/SWivid/F5-TTS/issues/434 by @RASPIAUDIO, Youtube videos posted by @JarodMica) I have decided to give it a try to train a model for the Mongolian language. Mongolian language is based off of Cyrillic, so has a similar alphabet to Russian but has 2 additional letters. Pronunciation differs, though. And coming from software engineering background, without much experience on building ML models, has been posing the fair share of challenges. I am creating this issue to provide what I have tried and ask questions for guidances from folks who might be able to guide me in the right direction as I am not seeing satisfactory results so far.

Hardware specs:

Renting an hourly cloud GPU: 1 x A40 9 vCPU 50 GB RAM

Dataset:

3800~ samples of a single speaker in studio quality recording of about 6 hours

Sample rate at 24,000 Hz. Examples:

af1|Гэрэл сайн болсныг Бурхан харж, Бурхан гэрлийг харанхуйгаас салгажээ.
af2|Бурхан гэрлийг өдөр гэж нэрлээд, харанхуйг шөнө гэж нэрлэлээ.

Gradio App

Followed the Installation steps, downloaded and installed the dependencies/packages/libraries and ran the Gradio app for fine tuning.

Loaded my dataset
Did a vocab check and it found the 2 additional letters as expected: ө, ү so extended the vocab with them

Tokenizer example:

['   ' , ' н ' , '   ' , ' э ' , '   ' , ' г ' , '   ' , ' и ' , '   ' , ' й ' , '   ' , ' н ' , '   ' , ' х ' , '   ' , '   ' , ' н ' , '   ' , ' ь ' , '   ' , '   ' , ' н ' , '   ' , ' э ' , '   ' , ' р ' , '   ' , '   ' , ' а ' , '   ' , ' д ' , '   ' , ' а ' , ' , ' , '   ' , '   ' , ' н ' , '   ' , ' ө ' , '   ' , ' г ' , '   ' , ' ө ' , '   ' , ' ө ' , '   ' , ' г ' , '   ' , ' и ' , '   ' , ' й ' , '   ' , ' н ' , '   ' , ' х ' , '   ' , '   ' , ' н ' , '   ' , ' ь ' , '   ' , '   ' , ' н ' , '   ' , ' э ' , '   ' , ' р ' , '   ' , '   ' , ' з ' , '   ' , ' и ' , '   ' , ' л ' , '   ' , ' л ' , '   ' , ' а ' , ' . ']

The following are various scenarios I have tried.

Tokenizer type: pinyin and also tried char
Batch size per GPU: 3200, 6400, 9200, 19200
Max samples: 32, 64
Learning Rate: tried 0.00001 and also 0.000005
Model: F5TTS_Base
Gradient Accumulation Steps: 1 (didn't change it)
Max Gradient Norm: 1 (didn't change it)
Epoch: tried 50, 100, 150
Warmup Updates: tried 100, 300, 500, 1000
Mixed Precision: fp16 (as I think A40 supports it)
Note: I did try the Auto Settings as well. I think the batch size per GPU was 1745 which is too small for my GPU? And epoch came up to be 781.

Results so far

I have been testing the checkpoints at various intervals and haven't gotten satisfactory results. The voice does sound very similar to the original sound of the speaker provided in the dataset. But:

Words are jibberish
It is too fast (with no breaks...etc) as if it is trying a tongue-twister.

Loss

I think the results I am seeing aligns well with the Loss graph that I see on Tensorboard as well. Below is from my latest training attempt with 19,200 batch size per GPU, 32 max samples, 50 epochs, 300 warmup updates, 400 save per updates, 200 last per steps, fp16, f5tts_base model, char tokenizer, learning rate of 0.00001 and batch size type: frame.

Loss Rate:

Learning Rate Decay:

Reference audio: https://whyp.it/tracks/225389/af143?token=s8DdA Checkpoint inference: https://whyp.it/tracks/225391/af143-ckpt?token=i4bi0

Thoughts / Questions

I have read in a few issues where @SWivid for example mentions that at 200K steps or beyond, we might get some intelliglbe audio. But I have also seen examples where @RASPIAUDIO in his Youtube video tutorial had roughly 200-300 samples and the sample checkpoint was not too bad and there was an improvement which could then be a trigger to collect more dataset and spend more time training. In my case, I am not seeing much improvements even with nearly 4000 samples (I know it is small) and 6 hours length of a single speaker in studio quality. At least if I hear some words proncounced correctly or close enough, proper gaps and pauses then I could look at more datasets...etc.

So, could it be that:

There is an inherent issue with the language itself?
Or possible issue with cyrillic? (I do see the vocab.txt already had Cyrillic symbols)
Should I just not expect to see loss decrease at 7000th step mark and just aim to get 200K steps at min?
Should I try different settings? (I did try the Auto Settings as well)
Or should I keep adding more and more data set and iterate this?

Any of these could be like a shot in the dark and I am trying to understand if there is a systematic way to approach it so that I can rule-in or rule-out hypothesis early on?

SWivid commented 14 hours ago

200K steps or beyond

it's for training from scratch, say pretraining; finetuning will allow fewer steps to hear something. As Cyrillic is somehow different from pretrained distribution, e.g. we know French and English is close to each other in some way, it would take more time to learn pronunciation.

Possible improvement:

grapheme-to-phoneme tokenization
change the tokenizer like "Г", "э", "р", "э", "л", " ", "с", ...,], having words together

JarodMica commented 13 hours ago

Tokenizer example:

['   ' , ' н ' , '   ' , ' э ' , '   ' , ' г ' , '   ' , ' и ' , '   ' , ' й ' , '   ' , ' н ' , '   ' , ' х ' , '   ' , '   ' , ' н ' , '   ' , ' ь ' , '   ' , '   ' , ' н ' , '   ' , ' э ' , '   ' , ' р ' , '   ' , '   ' , ' а ' , '   ' , ' д ' , '   ' , ' а ' , ' , ' , '   ' , '   ' , ' н ' , '   ' , ' ө ' , '   ' , ' г ' , '   ' , ' ө ' , '   ' , ' ө ' , '   ' , ' г ' , '   ' , ' и ' , '   ' , ' й ' , '   ' , ' н ' , '   ' , ' х ' , '   ' , '   ' , ' н ' , '   ' , ' ь ' , '   ' , '   ' , ' н ' , '   ' , ' э ' , '   ' , ' р ' , '   ' , '   ' , ' з ' , '   ' , ' и ' , '   ' , ' л ' , '   ' , ' л ' , '   ' , ' а ' , ' . ']

This is probably 99% the issue right here. You'll want to modify convert_to_pinyin to not add spaces between Cyrillic. I saw the same issue with jumbled words in Japanese because it wants to add the extra space:

Sample modification:

def convert_char_to_pinyin(text_list, polyphone=True):
    final_text_list = []
    zh_quote_trans = str.maketrans({'“': '"', '”': '"', '‘': "'", '’': "'"})
    custom_trans = str.maketrans({';': ','})

    def is_japanese(c):
        return (
            '\u3040' <= c <= '\u309F' or  # Hiragana
            '\u30A0' <= c <= '\u30FF' or  # Katakana
            '\uFF66' <= c <= '\uFF9F'     # Half-width Katakana
        )

    for text in text_list:
        char_list = []
        text = text.translate(zh_quote_trans)
        text = text.translate(custom_trans)
        for seg in jieba.cut(text):
            seg_byte_len = len(seg.encode('utf-8'))
            if seg_byte_len == len(seg):
                if char_list and seg_byte_len > 1 and char_list[-1] not in " :'\"":
                    char_list.append(" ")
                char_list.extend(seg)
            elif polyphone and seg_byte_len == 3 * len(seg):
                seg_pinyin = lazy_pinyin(seg, style=Style.TONE3, tone_sandhi=True)
                for p in seg_pinyin:
                    if p not in "。，、；：？！《》【】—…":
                        if not char_list or not is_japanese(char_list[-1]):
                            char_list.append(" ")
                    char_list.append(p)
            else:
                for c in seg:
                    if ord(c) < 256:
                        char_list.append(c)
                    elif is_japanese(c):
                        char_list.append(c)
                    else:
                        if c not in "。，、；：？！《》【】—…":
                            if not char_list or not is_japanese(char_list[-1]):
                                char_list.append(" ")
                            pinyin = lazy_pinyin(c, style=Style.TONE3, tone_sandhi=True)
                            char_list.extend(pinyin)
                        else:
                            char_list.append(c)
        final_text_list.append(char_list)
    return final_text_list

This only affects inference so the model should've trained fine, once you adapt it to Cyrillic, I guesstimate the output will improve.

amarbayar commented 11 hours ago

Okay. Thanks to both of you. Will try in the following order and see what happens. If I run into any issues or make good progress, I will share my findings here.

Remove the logic that is adding additional space between letters/symbols if it is not Chinese such that tokenized text is represented like:
```
"Г", "э", "р", "э", "л", " ", "с", ...,]
```
If this doesn't yield good results, will try the phoneme approach. I. have found: https://github.com/xinjli/transphone. The rough idea I have is that I can use it to phonemize my dataset text, extract the unique symbols and add them to the pretrained vocab by extending. And I will need to implement a logic that converts each cyrillic character to the phoneme so that it can do a lookup from the vobab.txt. But will get to it after I experiment the first / faster approach.

SWivid / F5-TTS