Question about grapheme set - Githubissues

as-ideas / DeepPhonemizer

Grapheme to phoneme conversion with deep learning.

MIT License

343 stars 37 forks source link

Question about grapheme set #14

Open kkp15 opened 2 years ago

kkp15 commented 2 years ago

Hello. Thank you for this amazing repository! I have a question though. What’s the easiest way to get a unique grapheme set for a specific language? How did you get that list when training a multilingual model?

cschaefer26 commented 2 years ago

Hi, you can just extract it from the training data. E.g. you collect the set of characters from it and then paste the result into the config. That's basically how i proceeded for the trained models (I filtered some graphemes though).

skanda1005 commented 2 years ago

Hi @cschaefer26 , I wanted to train the model for hindi, but had doubts on how I need to make the config file, especially the input and output because I'm getting index out of range error. Thanks!

cschaefer26 commented 2 years ago

Hi, you can use the standard config file, but you will have to adjust the language and:

text_symbols phoneme_symbols

according to the symbols that occur in your dataset!

skanda1005 commented 2 years ago

Got it working, thanks!

cschaefer26 commented 2 years ago

Nice, let me know if you run into issues.

skanda1005 commented 2 years ago

Hi, so I realized that in my phoneme set, if some of the phonemes have multiple characters, it doesn't get parsed and those multiple char phones are either removed or replaced after preprocessing. Any solutions to this issue?

cschaefer26 commented 2 years ago

Hi, multiple characters shouldn't be a problem, the cmudict model has multi-char phonemes: https://github.com/as-ideas/DeepPhonemizer#:~:text=en_us_cmudict_forward

You can pass each sample as a tuple of [str, str, list], e.g. ('en', 'word', ['p', 'h', 'o', 'neme'])

skanda1005 commented 2 years ago

So, I am training it in hindi and phones like t͡ʃ and ẽː dont get parsed. I used these as inputs for the tokenizer and there is no output meaning it doesn't get tokenized. PS. t͡ʃ is actually 3 chars, not 2. Would that cause a problem?

cschaefer26 commented 2 years ago

No that should be fine. Actually your example looks more like there should be three phoneme chars as output instead of a single phoneme instance incorporating all three chars (t͡ʃ). Just make sure the symbols are present in the config (phoneme_symbols, e.g. '͡')

skanda1005 commented 2 years ago

Oh, So should I separate the chars of that phone as 3 different elements in the list? e.g ['t', '͡', 'ʃ')

cschaefer26 commented 2 years ago

Yes, that's also how the standard config is set. You can then simply provide the phonemized words as strings.