Open rishikksh20 opened 7 months ago
I just haven't found a good phonemizer compatible to specific IPA subset that main network is trained on. I tried to use espeak, but it's phonemes are different from montreal forced aligner ones. I decided to just train encoder-decoder style GPT to convert to phonemes and durations in a single task.
I trained MFA on espeak phonemes, if you ask I can share english trained MFA on Espeak IPA.
@ex3ndr have you tried Gaussian Upsampling for length regulator from sampled durations ?
I did not, you think this would improve something? I never need to fit tokens to a specific timeframe in my setup
Hi @ex3ndr Have you tested this model only for auto-regressive duration predictor? Like I give input simply text and predict phonemes along with duration does not pitch. As per logic, it will simply work, but I am skeptical because pitch determines prosody and duration impacts prosody so if I remove pitch it might not work accurately. what's your thought?
Yes, my first version was duration-only. This worked well, i have added coarse pitch later to improve prosody.
ok thanks @ex3ndr
Hi @ex3ndr , I check out your code here: https://github.com/ex3ndr/supervoice-gpt/blob/master/train_tokenizer.py I saw you have tried two training one with text and the other is with phonemes any specific reason you ultimately go with text rather than phoneme tokenization?