Difference between Phoneme and Text tokenizer

ex3ndr / supervoice-gpt

GPT-style network for phonemization with durations of text

61 stars 9 forks source link

Difference between Phoneme and Text tokenizer #2

Open rishikksh20 opened 7 months ago

rishikksh20 commented 7 months ago

Hi @ex3ndr , I check out your code here: https://github.com/ex3ndr/supervoice-gpt/blob/master/train_tokenizer.py I saw you have tried two training one with text and the other is with phonemes any specific reason you ultimately go with text rather than phoneme tokenization?

ex3ndr commented 7 months ago

I just haven't found a good phonemizer compatible to specific IPA subset that main network is trained on. I tried to use espeak, but it's phonemes are different from montreal forced aligner ones. I decided to just train encoder-decoder style GPT to convert to phonemes and durations in a single task.

rishikksh20 commented 7 months ago

I trained MFA on espeak phonemes, if you ask I can share english trained MFA on Espeak IPA.

rishikksh20 commented 7 months ago

@ex3ndr have you tried Gaussian Upsampling for length regulator from sampled durations ?

ex3ndr commented 7 months ago

I did not, you think this would improve something? I never need to fit tokens to a specific timeframe in my setup

rishikksh20 commented 5 months ago

Hi @ex3ndr Have you tested this model only for auto-regressive duration predictor? Like I give input simply text and predict phonemes along with duration does not pitch. As per logic, it will simply work, but I am skeptical because pitch determines prosody and duration impacts prosody so if I remove pitch it might not work accurately. what's your thought?

ex3ndr commented 5 months ago

Yes, my first version was duration-only. This worked well, i have added coarse pitch later to improve prosody.

rishikksh20 commented 5 months ago

ok thanks @ex3ndr