Open amarbayar opened 16 hours ago
200K steps or beyond
it's for training from scratch, say pretraining; finetuning will allow fewer steps to hear something. As Cyrillic is somehow different from pretrained distribution, e.g. we know French and English is close to each other in some way, it would take more time to learn pronunciation.
Possible improvement:
"Г", "э", "р", "э", "л", " ", "с", ...,]
, having words together
- Tokenizer example:
[' ' , ' н ' , ' ' , ' э ' , ' ' , ' г ' , ' ' , ' и ' , ' ' , ' й ' , ' ' , ' н ' , ' ' , ' х ' , ' ' , ' ' , ' н ' , ' ' , ' ь ' , ' ' , ' ' , ' н ' , ' ' , ' э ' , ' ' , ' р ' , ' ' , ' ' , ' а ' , ' ' , ' д ' , ' ' , ' а ' , ' , ' , ' ' , ' ' , ' н ' , ' ' , ' ө ' , ' ' , ' г ' , ' ' , ' ө ' , ' ' , ' ө ' , ' ' , ' г ' , ' ' , ' и ' , ' ' , ' й ' , ' ' , ' н ' , ' ' , ' х ' , ' ' , ' ' , ' н ' , ' ' , ' ь ' , ' ' , ' ' , ' н ' , ' ' , ' э ' , ' ' , ' р ' , ' ' , ' ' , ' з ' , ' ' , ' и ' , ' ' , ' л ' , ' ' , ' л ' , ' ' , ' а ' , ' . ']
This is probably 99% the issue right here. You'll want to modify convert_to_pinyin
to not add spaces between Cyrillic. I saw the same issue with jumbled words in Japanese because it wants to add the extra space:
Sample modification:
def convert_char_to_pinyin(text_list, polyphone=True):
final_text_list = []
zh_quote_trans = str.maketrans({'“': '"', '”': '"', '‘': "'", '’': "'"})
custom_trans = str.maketrans({';': ','})
def is_japanese(c):
return (
'\u3040' <= c <= '\u309F' or # Hiragana
'\u30A0' <= c <= '\u30FF' or # Katakana
'\uFF66' <= c <= '\uFF9F' # Half-width Katakana
)
for text in text_list:
char_list = []
text = text.translate(zh_quote_trans)
text = text.translate(custom_trans)
for seg in jieba.cut(text):
seg_byte_len = len(seg.encode('utf-8'))
if seg_byte_len == len(seg):
if char_list and seg_byte_len > 1 and char_list[-1] not in " :'\"":
char_list.append(" ")
char_list.extend(seg)
elif polyphone and seg_byte_len == 3 * len(seg):
seg_pinyin = lazy_pinyin(seg, style=Style.TONE3, tone_sandhi=True)
for p in seg_pinyin:
if p not in "。,、;:?!《》【】—…":
if not char_list or not is_japanese(char_list[-1]):
char_list.append(" ")
char_list.append(p)
else:
for c in seg:
if ord(c) < 256:
char_list.append(c)
elif is_japanese(c):
char_list.append(c)
else:
if c not in "。,、;:?!《》【】—…":
if not char_list or not is_japanese(char_list[-1]):
char_list.append(" ")
pinyin = lazy_pinyin(c, style=Style.TONE3, tone_sandhi=True)
char_list.extend(pinyin)
else:
char_list.append(c)
final_text_list.append(char_list)
return final_text_list
This only affects inference so the model should've trained fine, once you adapt it to Cyrillic, I guesstimate the output will improve.
Okay. Thanks to both of you. Will try in the following order and see what happens. If I run into any issues or make good progress, I will share my findings here.
Remove the logic that is adding additional space between letters/symbols if it is not Chinese such that tokenized text is represented like:
"Г", "э", "р", "э", "л", " ", "с", ...,]
If this doesn't yield good results, will try the phoneme approach. I. have found: https://github.com/xinjli/transphone. The rough idea I have is that I can use it to phonemize my dataset text, extract the unique symbols and add them to the pretrained vocab by extending. And I will need to implement a logic that converts each cyrillic character to the phoneme so that it can do a lookup from the vobab.txt. But will get to it after I experiment the first / faster approach.
Checks
Question details
After reading some successful cases and promising results (like the French model release in https://github.com/SWivid/F5-TTS/issues/434 by @RASPIAUDIO, Youtube videos posted by @JarodMica) I have decided to give it a try to train a model for the Mongolian language. Mongolian language is based off of Cyrillic, so has a similar alphabet to Russian but has 2 additional letters. Pronunciation differs, though. And coming from software engineering background, without much experience on building ML models, has been posing the fair share of challenges. I am creating this issue to provide what I have tried and ask questions for guidances from folks who might be able to guide me in the right direction as I am not seeing satisfactory results so far.
Hardware specs:
Dataset:
Gradio App
Followed the Installation steps, downloaded and installed the dependencies/packages/libraries and ran the Gradio app for fine tuning.
ө, ү
so extended the vocab with themThe following are various scenarios I have tried.
pinyin
and also triedchar
Auto Settings
as well. I think the batch size per GPU was1745
which is too small for my GPU? And epoch came up to be781
.Results so far
I have been testing the checkpoints at various intervals and haven't gotten satisfactory results. The voice does sound very similar to the original sound of the speaker provided in the dataset. But:
Loss
I think the results I am seeing aligns well with the Loss graph that I see on Tensorboard as well. Below is from my latest training attempt with 19,200 batch size per GPU, 32 max samples, 50 epochs, 300 warmup updates, 400 save per updates, 200 last per steps, fp16, f5tts_base model,
char
tokenizer, learning rate of0.00001
and batch size type:frame
.Loss Rate:
Learning Rate Decay:
Reference audio: https://whyp.it/tracks/225389/af143?token=s8DdA Checkpoint inference: https://whyp.it/tracks/225391/af143-ckpt?token=i4bi0
Thoughts / Questions
I have read in a few issues where @SWivid for example mentions that at 200K steps or beyond, we might get some intelliglbe audio. But I have also seen examples where @RASPIAUDIO in his Youtube video tutorial had roughly 200-300 samples and the sample checkpoint was not too bad and there was an improvement which could then be a trigger to collect more dataset and spend more time training. In my case, I am not seeing much improvements even with nearly 4000 samples (I know it is small) and 6 hours length of a single speaker in studio quality. At least if I hear some words proncounced correctly or close enough, proper gaps and pauses then I could look at more datasets...etc.
So, could it be that:
Auto Settings
as well)Any of these could be like a shot in the dark and I am trying to understand if there is a systematic way to approach it so that I can rule-in or rule-out hypothesis early on?