Wondering if Phonetisaurus needs dev. dataset during training or not?

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.training the model using aligned corpus
2.
3.

What is the expected output? What do you see instead?

In general, for machine learning application such as sequitur or DirecTl-p, it 
is required both training and dev. datasets during training the model. However, 
Phonetisaurus seems not require the development dataset at all. Therefore, this 
makes me wonder if there is any sub-task for selecting the dev. data (randomly) 
from the training data during the training process or not.
-Could you please confirm my question?
-If the training process includes such a task, how could I find where it is 
located?

What version of the product are you using? On what operating system?

last version of Phonetisaurus

Please provide any additional information below.

Original issue reported on code.google.com by kheangs...@gmail.com on 16 Mar 2015 at 4:36

GoogleCodeExporter commented 9 years ago

you can take a look at the download here:

https://www.dropbox.com/s/154q9yt3xenj2gr/phonetisaurus-0.8a.tgz

this has a full experimental setup for the CMU dict based on a standard 
test/train split.

there is no dev set for the alignment [EM over the full training set].  

for the model training process you can use any LM training toolkit / smoothing 
method you like.  some of these, like [non fixed variant of] modified kneser 
ney smoothing might require/support tuning with a dev set.  some do not.  if 
you wish to use one of such methods you would hold out some fraction from the 
aligned corpus and then use it to tune your LM. similar story if you use the 
RnnLM extension [there are more details here, which are described on the 
associated page].

of course the test set is held out from both the alignment and model training 
phases.

Original comment by Josef.Ro...@gmail.com on 16 Mar 2015 at 4:47

GoogleCodeExporter commented 9 years ago

Thank you so much for your quick response!
I understand it now...

Original comment by kheangs...@gmail.com on 16 Mar 2015 at 5:32

GoogleCodeExporter commented 9 years ago

by the way if you trying to select something based on evaluation you should 
definitely throw slearp into the mix: http://en.sourceforge.jp/projects/slearp/ 
 i'm pretty sure it is the #1 in terms of accuracy at the moment - especially 
for smaller datasets.

Original comment by Josef.Ro...@gmail.com on 19 Mar 2015 at 8:40

darongmean / phonetisaurus

Wondering if Phonetisaurus needs dev. dataset during training or not? #37