cmusphinx / g2p-seq2seq

G2P with Tensorflow
Other
669 stars 195 forks source link

Generate dict from numbers #125

Closed gitmikoy closed 6 years ago

gitmikoy commented 6 years ago

How to generate dictionary from numbers? example. g2p-seq2seq --interactive --model_dir my/model

1 30

outputs: Invalid Symbol.

nurtas-m commented 6 years ago

@gitmikoy The simplest way is to train a new model on the data with numbers and its pronunciations in train dictionary. In this case you have to add to your train dictionary as much as possible examples of numbers and its pronunciations: 1 W AH N 2 T UW 35 TH ER T IY F AY V 796 S EH V AH N HH AH N D R AH D AH N D N AY N T IY S IH K S ... But, keep in mind, that there exists limit in decoding sequence length (by default, this parameter max_length=30). The longer maximum sequence is the worse decoding performance is.

A little bit complicated, but right way to solve this problem: pre-process numbers before transmit it to inference. You need to implement the module that transmits all the numbers (integers, ordinals, fractional numbers) into its spellings in your language. In this case, your model will be way more reliable and accurate. Also, you don't need to add to your dictionary all possible integers: one W AH N two T UW thirty TH ER T IY hundred HH AH N D R AH D thousand TH AW Z AH N D

In first solution, the trained model will be less reliable, because completely different (in pronunciation) numbers may differ from each other only in one character: 25 T W EH N T IY F AY V 15 F IH F T IY N