danpovey / pocolm

Small language toolkit for creation, interpolation and pruning of ARPA language models
Other
90 stars 48 forks source link

default metaparameters, training phone-based trigram on single corpus, #90

Open KarelVesely84 opened 7 years ago

KarelVesely84 commented 7 years ago

Hi, i'd like to ask, what would be the actual meaning of the 'metaparameter' floats which can be given to the training script? Is there some guideline how to manually choose reasonable default 'metaparameters', say when training a phone-based trigram LM from a single corpus? (can be handy when the automatic method fails...) Thanks! K.

danpovey commented 7 years ago

Hm. It's various discounting-related parameters appearing in a formula, and at this second I can't recall the exact meaning of each one. I assume the amount of data you have is pretty small. In this case it may be easier to just write a script to estimate a Kneser-Ney LM. Maybe we could have someone here help with that, it would be a good exercise for some of the students. It's a shame that pocolm doesn't handle these corner cases very well. Are you avoiding SRILM because of license reasons?

On Tue, Jul 11, 2017 at 10:17 AM, Karel Vesely notifications@github.com wrote:

Hi, i'd like to ask, what would be the actual meaning of the 'metaparameter' floats which can be given to the training script? Is there some guideline how to manually choose reasonable default 'metaparameters', say when training a phone-based trigram LM from a single corpus? (can be handy when the automatic method fails...) Thanks! K.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/90, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu5XlyzgajhDDAw_XTeu7NAaRK7--ks5sM4OKgaJpZM4OUTZL .

wantee commented 7 years ago

Did you talk about the --bypass-metaparameter-optimization option provided by train_lm.py. If it is that, I think you can't choose any default values by manually. In order to get the approaviate numbers for them, one has to run train_lm.py without that option for one time and find the numbers in log. I think this is just a option to speedup the training when someone others would like to reproduce a model on same dataset. If dataset is small, you can ignore this option and run the training, it won't take much time.