Data used for pretrained g2p-seq2seq-cmudict model

vijay120 commented 6 years ago

I want to reproduce the results of the g2p-seq2seq-cmudict model. The README.md mentions the following:

A pretrained model 2-layer LSTM with 512 hidden units is available for download on cmusphinx website. Unpack the model after download. The model is trained on CMU English dictionary

However, https://github.com/cmusphinx/cmudict is a versioned dataset. As of now, it contains data with stress information, which the g2p-seq2seq-cmudict model does not output. I think it was trained on an older version of the dataset without the stress information.

Can you provide a commit hash or link that contains the snapshot of the dataset that was trained to get to the g2p-seq2seq-cmudict model. Can you also publish all the hyperparameter values needed to train the model for g2p-seq2seq-cmudict.

nshmyrev commented 6 years ago

The link is in README

https://sourceforge.net/projects/cmusphinx/files/G2P%20Models/phonetisaurus-cmudict-split.tar.gz

vijay120 commented 6 years ago

Thanks, so I ran the hyperparameters mentioned in the README.txt file in the folder:

python /home/ubuntu/g2p-seq2seq/g2p_seq2seq/g2p.py --train cmudict.dic.train --test cmudict.dic.test --num_layers 2 --size 512 --model model --max_steps 0

Then, when I ran the model, g2p-seq2seq --interactive --model model/, I do not get the same results as the pretrained model. First, the model only accepts uppercase words, whereas the pretrained model accepts both upper and lower case. Second, the phonetic results are different:

word: VENKATACHALAPATHI
pretrained output: V EH K AH T AA K AA L AH P AA TH IY
trained output: V EH N K AH L AA P AA TH AH DH IY

word: LAXMINARAYANA
pretrained output: L AE K S M AH N ER AY AE N AH
trained output: L AE K S M AH N ER AE N AH

Is there a different set of hyperparameters that gives the desired pretrained results?

krishnagovindu commented 6 years ago

Hi can u please help me how to do this g2p seq2seq for cent os

nurtas-m commented 6 years ago

Hello, krishnagovindu! First of all, you need to derive a dictionary for the language for which you want to train g2p_seq2seq model. Then, you need to split dictionary into two files: for training and testing purposes. Each line in the files should contain word (sequence of graphemes) and it's pronunciation (sequence of phonemes): HELLO HH EH L OW BYE B AY ...

The more examples your dictionary contain, the better g2p_seq2seq model you receive after training. For example, for training English model, we apply CMUDICT-PRONALSYL dictionary with more than 100.000 words in training set and more than 10.000 words in test set.

After deriving and preparing datasets, you may start training your own model for your language. It doesn't matter which distributive of Linux you use. Just clone g2p_seq2seq and use it for your need. More info about training and decoding of the trained model, you may find in README.

krishnagovindu commented 6 years ago

can u please share if do u have any video for better understanding. its very much need full to me pls pls

On Wed, Apr 4, 2018 at 6:55 PM, nurtas-m notifications@github.com wrote:

Hello, krishnagovindu! First of all, you need to derive a dictionary for the language for which you want to train g2p_seq2seq model. Then, you need to split dictionary into two files: for training and testing purposes. Each line in the files should contain word (sequence of graphemes) and it's pronunciation (sequence of phonemes): HELLO HH EH L OW BYE B AY ...

The more examples your dictionary contain, the better g2p_seq2seq model you receive after training. For example, for training English model, we apply CMUDICT-PRONALSYL dictionary with more than 100.000 words in training set and more than 10.000 words in test set.

After deriving and preparing datasets, you may start training your own model for your language. It doesn't matter which distributive of Linux you use. Just clone g2p_seq2seq and use it for your need. More info about training and decoding of the trained model, you may find in README.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cmusphinx/g2p-seq2seq/issues/93#issuecomment-378598607, or mute the thread https://github.com/notifications/unsubscribe-auth/AkRYtnjlmxWPB7phEtwwv0-Gbc1Qjcc2ks5tlMm0gaJpZM4RdGCB .

krishnagovindu commented 6 years ago

Hi, i want learn how to build language model and acoustic model. where i can learn... is there any training centers in online. if yes please tell me.

Regards, krishna

On Tue, Apr 10, 2018 at 10:39 AM, Krishna Govindu < krishnagovindu.joga@gmail.com> wrote:

can u please share if do u have any video for better understanding. its very much need full to me pls pls

On Wed, Apr 4, 2018 at 6:55 PM, nurtas-m notifications@github.com wrote:

Hello, krishnagovindu! First of all, you need to derive a dictionary for the language for which you want to train g2p_seq2seq model. Then, you need to split dictionary into two files: for training and testing purposes. Each line in the files should contain word (sequence of graphemes) and it's pronunciation (sequence of phonemes): HELLO HH EH L OW BYE B AY ...

The more examples your dictionary contain, the better g2p_seq2seq model you receive after training. For example, for training English model, we apply CMUDICT-PRONALSYL dictionary with more than 100.000 words in training set and more than 10.000 words in test set.

After deriving and preparing datasets, you may start training your own model for your language. It doesn't matter which distributive of Linux you use. Just clone g2p_seq2seq and use it for your need. More info about training and decoding of the trained model, you may find in README.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cmusphinx/g2p-seq2seq/issues/93#issuecomment-378598607, or mute the thread https://github.com/notifications/unsubscribe-auth/AkRYtnjlmxWPB7phEtwwv0-Gbc1Qjcc2ks5tlMm0gaJpZM4RdGCB .

AshishDagar commented 6 years ago

hello @krishnagovindu

Please follow below links for training acoustic model and language model

For acoustic model : https://github.com/cmusphinx/g2p-seq2seq

For language model : https://cmusphinx.github.io/wiki/tutoriallm/

Anusha-G-Rao commented 6 years ago

Thanks, so I ran the hyperparameters mentioned in the README.txt file in the folder:

python /home/ubuntu/g2p-seq2seq/g2p_seq2seq/g2p.py --train cmudict.dic.train --test cmudict.dic.test --num_layers 2 --size 512 --model model --max_steps 0

Then, when I ran the model, g2p-seq2seq --interactive --model model/, I do not get the same results as the pretrained model. First, the model only accepts uppercase words, whereas the pretrained model accepts both upper and lower case. Second, the phonetic results are different:
word: VENKATACHALAPATHI
pretrained output: V EH K AH T AA K AA L AH P AA TH IY
trained output: V EH N K AH L AA P AA TH AH DH IY

word: LAXMINARAYANA
pretrained output: L AE K S M AH N ER AY AE N AH
trained output: L AE K S M AH N ER AE N AH
Is there a different set of hyperparameters that gives the desired pretrained results?

Hi Vijay,

I see you have used g2p-seq2seq-cmudict for creating pronunciation of indian names!!

My specific case is to recognise names. Eg: List all the papers by Iryna So my first step would be to create dictionary for named entities, then train the language model.

could you tell me, using g2p-seq2seq how did your WER improve? Was it able to recognise the name accurately? What do i need to do in case i want it to recognise names which are not in the dictionary?

cmusphinx / g2p-seq2seq

Data used for pretrained g2p-seq2seq-cmudict model #93