eddieantonio / mitlm

Automatically exported from code.google.com/p/mitlm
http://code.google.com/p/mitlm
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

nan output when using -op devset #23

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
1) 
set option = "-i LI -opt-alg LBFGSB "
interpolate-ngram  "$trainingdata,$adaptdata" $option -wl $trigram  -op $devset 
-eval-perp $testset
2)
set option = "-i CM -opt-alg LBFGSB "
interpolate-ngram  -c "$count21,$count22" $option 
-if "log:sumhist:$effcount21;log:sumhist:$effcount22" -wl $trigram -op $devset 
-eval-perp $testset

Both of the above methods create many nan backoffs in the output LM.
However, their perplexities seems OK.
If the -op $devset is not used, the nan is not created. But the perplexities of 
"CM" and "GLI" are over double of the "LI"

What version of the product are you using? On what operating system?
mit0.4, in CenOS 4.7

Original issue reported on code.google.com by hu.xinhu...@gmail.com on 10 Nov 2010 at 8:15

GoogleCodeExporter commented 8 years ago
Same problem with
mitlm64/interpolate-ngram -o 2 -v corpus_50.vocab -u true "corpus1.txt, 
corpus2.txt, corpus3.txt, corpus4.txt, corpus5.txt, corpus6.txt" -op 
dev_set.txt -wl mix_2.lm
-opt-alg default, so it's LBFGS.

It seems that this problem arises only with really big text corpora (>2 GB). 
And backoff = nan only with words that should have "big" negative value 
otherwise.
I'm using mitlm 0.4.1 in Cygwin (cygwin1.dll version 1.7.32).

Original comment by verypret...@gmail.com on 14 Nov 2014 at 2:56