amit-bhavsar / mitlm

Automatically exported from code.google.com/p/mitlm
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Linear interpolation of LMs with --optimize-perplexity crashes #2

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
When I run smth like:

interpolate-ngram -l lm1.mitlm lm2.mitlm --write-lm tmp3.arpa.gz
--optimize-perplexity dev.txt

Loading component LM lm1.mitll...
Loading component LM lm2.mitlm...
Interpolating component LMs...
Interpolation Method = LI
Loading development set dev.txt...
Segmentation fault (core dumped)

gdb shows:

(gdb) bt
#0  0x0000000000447d9f in PerplexityOptimizer::LoadCorpus
(this=0x7fffd28165d0, corpusFile=Variable "corpusFile" is not available.
) at src/util/FastIO.h:54
#1  0x0000000000479ee5 in main (argc=8, argv=0x7fffd2816de8) at
src/interpolate-ngram.cpp:270

I'm using mitlm from SVN under Linux, amd64.

Original issue reported on code.google.com by alu...@gmail.com on 3 Dec 2008 at 2:20

GoogleCodeExporter commented 9 years ago

Original comment by bojune...@gmail.com on 4 Dec 2008 at 8:00

GoogleCodeExporter commented 9 years ago
Hi alumae,

Does the development set corpus dev.txt exist in the current directory?  The 
stack
trace and code suggest that dev.txt does not exist.  The development set corpus 
is
used to tune the interpolation parameters.

Paul

Original comment by bojune...@gmail.com on 8 Dec 2008 at 4:00

GoogleCodeExporter commented 9 years ago
Yes, dev.txt exists:

$ ~/lbin/mitlm-svn/interpolate-ngram -l  tmp.mitlm tmp2.mitlm --write-lm 
tmp3.arpa.gz
--optimize-perplexity  dev.txt Loading component LM tmp.mitlm...
Loading component LM tmp2.mitlm...
Interpolating component LMs...
Interpolation Method = LI
Loading development set dev.txt...
Segmentation fault

$ wc dev.txt
  4228  72175 434012 dev.txt

$ head -2 dev.txt
bonjour {breath}
investiture aujourd'hui à Bamako Mali ...

$ gdb -c core.32543 ~/lbin/mitlm-svn/interpolate-ngram

[...]

(gdb) bt
#0  0x00000000004481f1 in PerplexityOptimizer::LoadCorpus (this=0x7fffefd8a8d0,
corpusFile=Variable "corpusFile" is not available.
) at src/util/FastIO.h:54
#1  0x000000000047a4c6 in main (argc=8, argv=0x7fffefd8b1a8) at
src/interpolate-ngram.cpp:270

Does it work for you?

Original comment by alu...@gmail.com on 8 Dec 2008 at 11:04

GoogleCodeExporter commented 9 years ago
BTW, if dev.txt didn't exist, I would get different error:

~/lbin/mitlm-svn/interpolate-ngram -l  tmp.mitlm tmp2.mitlm --write-lm 
tmp3.arpa.gz
--optimize-perplexity foooo.txt
Loading component LM tmp.mitlm...
Loading component LM tmp2.mitlm...
Interpolating component LMs...
Interpolation Method = LI
Loading development set foooo.txt...
terminate called after throwing an instance of 'std::runtime_error'
  what():  Cannot open file
Aborted (core dumped)

Original comment by alu...@gmail.com on 8 Dec 2008 at 11:08

GoogleCodeExporter commented 9 years ago
I am having a bit of difficulty reproducing this.  It works with my data files. 
 If
possible, can you please send me your data files so I can try to reproduce 
this? 
Also, can you try getting the stack trace with a debug build?  Thanks.

make clean
make DEBUG=1

Original comment by bojune...@gmail.com on 8 Dec 2008 at 4:20

GoogleCodeExporter commented 9 years ago
With DEBUG=1, I get the following error:
$ ~/lbin/mitlm-svn/interpolate-ngram -l  tmp1.mitlm tmp2.mitlm --write-lm
tmp3.arpa.gz --optimize-perplexity  dev.txt
Loading component LM tmp1.mitlm...
Loading component LM tmp2.mitlm...
Interpolating component LMs...
interpolate-ngram: src/vector/VectorOps.h:348: void MaskAssign(const Vector<I>&,
const Vector<R>&, Vector<F>&) [with M = VectorClosure<OpEqual, 
DenseVector<double>,
Scalar<int> >, I = VectorClosure<OpMult, 
IndirectVectorClosure<DenseVector<double>,
DenseVector<int> >, IndirectVectorClosure<DenseVector<double>, DenseVector<int> 
> >,
O = DenseVector<double>]: Assertion `mask.impl().length() == 
input.impl().length()'
failed.
Aborted (core dumped)

Backtrace from gdb:
(gdb) bt
#0  0x00000035c102ee25 in raise () from /lib64/libc.so.6
#1  0x00000035c1030770 in abort () from /lib64/libc.so.6
#2  0x00000035c1028616 in __assert_fail () from /lib64/libc.so.6
#3  0x000000000042d769 in MaskAssign<VectorClosure<OpEqual, DenseVector<double>,
Scalar<int> >, VectorClosure<OpMult, IndirectVectorClosure<DenseVector<double>,
DenseVector<int> >, IndirectVectorClosure<DenseVector<double>, DenseVector<int> 
> >,
DenseVector<double> > (mask=@0x7fff95524c80, input=@0x7fff95524c20, 
output=@0x5b29b0)
    at src/vector/VectorOps.h:348
#4  0x00000000004260af in NgramLMBase::SetModel (this=0x5b3ac0, 
m=@0x7fff95525038,
vocabMap=@0x7fff95524d40, ngramMap=@0x7fff95524d80) at src/NgramLM.cpp:129
#5  0x000000000043256d in InterpolatedNgramLM::LoadLMs (this=0x7fff95525030,
lms=@0x7fff95525390) at src/InterpolatedNgramLM.cpp:63
#6  0x000000000046cace in main (argc=8, argv=0x7fff95525928) at
src/interpolate-ngram.cpp:194

I attached my 2 text files and the dev.txt file. LMs were produced by:
estimate-ngram -read-text tmp1.txt --write-binary-lm tmp1.mitlm
estimate-ngram -read-text tmp2.txt --write-binary-lm tmp2.mitlm

Original comment by alu...@gmail.com on 8 Dec 2008 at 4:30

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by bojune...@gmail.com on 8 Dec 2008 at 5:47

GoogleCodeExporter commented 9 years ago
This issue only affect binary LM files.  As the binary version number has been
changed, all binary files need to be rebuilt.

- Modified binary representation of Vocab to explicitly store length.
- Reading NgramVector from binary file did not update words() and hists() views.
- Incremented binary file version number.

Original comment by bojune...@gmail.com on 8 Dec 2008 at 9:45