andrewphamvn / mitlm

Automatically exported from code.google.com/p/mitlm
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

segfault for interpolate-ngram #18

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

Hi I'm trying to interpolate two fairly straightforward 3gram lms with the
interpolate-ngram tool.

The command I'm running is,
-------------------
$ interpolate-ngram -o 3 -l lm1.arpa,lm2.arpa -wl lm1lm2.arpa
Loading component LM lm1.arpa...
Loading component LM lm2.arpa...
Segmentation fault
-------------------

The first lm was created with the estimate-ngram tool from a fairly small
training text (apprx 70mb),

$ estimate-ngram -t lm1.txt -wl lm1.arpa -o 3

The second lm is the gigaword 64k NVP 3gram model from Keith Vertanen's
open source LM page,

http://www.keithv.com/software/giga/

My guess is that there is something about the KV model that
interpolate-ngram doesn't like, but it isn't terribly clear what that might be.

Also, neither of the vocabularies is a subset of the other (although I
don't know whether or not that is relevant).

Original issue reported on code.google.com by Josef.Ro...@gmail.com on 28 Feb 2010 at 1:47

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Nevermind.  I was just being an idiot.

The segfault was being brought on by an inconsistency in the case of some 
vocabulary
terms in the Vertanen model, a problem which I introduced. Specifically, some
instances of a word were upper case while other instances in other ngrams were
lowercase.  Once I fixed this the problem disappeared.

Original comment by Josef.Ro...@gmail.com on 28 Feb 2010 at 2:16

GoogleCodeExporter commented 9 years ago
In case you ever encounter the segfault, I recommend to recompile the sources 
with
"make DEBUG=1" which will turn on all assertions and the application will 
probably
break on a failure of some assertion which will give you much better idea about 
what
happened.

For example I found out that if there is a 3-gram "A B C", there has to be also 
the
2-gram "B C" in the ARPA model, otherwise an assertion fails, which was 
probably the
reason of my segfaults. It happened probably because of LM pruning with srilm.

Miso Fapso

Original comment by michal.f...@gmail.com on 21 May 2010 at 6:29