nan output when using -op devset

eddieantonio / mitlm

Automatically exported from code.google.com/p/mitlm

BSD 3-Clause "New" or "Revised" License

1 stars 0 forks source link

1) 
set option = "-i LI -opt-alg LBFGSB "
interpolate-ngram  "$trainingdata,$adaptdata" $option -wl $trigram  -op $devset 
-eval-perp $testset
2)
set option = "-i CM -opt-alg LBFGSB "
interpolate-ngram  -c "$count21,$count22" $option 
-if "log:sumhist:$effcount21;log:sumhist:$effcount22" -wl $trigram -op $devset 
-eval-perp $testset

Both of the above methods create many nan backoffs in the output LM.
However, their perplexities seems OK.
If the -op $devset is not used, the nan is not created. But the perplexities of 
"CM" and "GLI" are over double of the "LI"

What version of the product are you using? On what operating system?
mit0.4, in CenOS 4.7

Original issue reported on code.google.com by hu.xinhu...@gmail.com on 10 Nov 2010 at 8:15

Same problem with mitlm64/interpolate-ngram -o 2 -v corpus_50.vocab -u true "corpus1.txt, corpus2.txt, corpus3.txt, corpus4.txt, corpus5.txt, corpus6.txt" -op dev_set.txt -wl mix_2.lm -opt-alg default, so it's LBFGS. It seems that this problem arises only with really big text corpora (>2 GB). And backoff = nan only with words that should have "big" negative value otherwise. I'm using mitlm 0.4.1 in Cygwin (cygwin1.dll version 1.7.32).

eddieantonio / mitlm

nan output when using -op devset #23