AnantLabs / berkeleylm

Automatically exported from code.google.com/p/berkeleylm
0 stars 0 forks source link

ArrayOutOfBoundsException when reading in a large ARPA file. #14

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I've written my own ARPA file generator, and when I create a small test file 
with it, reading it in by doing:

    NGramLanguageModel arpaLm = new NGramLanguageModel(arpaLmFilePath);

everything works fine. For ARPA files generated with a larger data set (see 
attached), I get an ArrayOutOfBoundsException:

    Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
        at edu.berkeley.nlp.lm.io.ArpaLmReader.parseNGram(ArpaLmReader.java:201)
            at edu.berkeley.nlp.lm.io.ArpaLmReader.parseLine(ArpaLmReader.java:172)
        at edu.berkeley.nlp.lm.io.ArpaLmReader.parseNGrams(ArpaLmReader.java:148)
        at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:78)
        at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:18)
        at edu.berkeley.nlp.lm.io.LmReaders.firstPassCommon(LmReaders.java:549)
        at edu.berkeley.nlp.lm.io.LmReaders.firstPassArpa(LmReaders.java:526)
        at edu.berkeley.nlp.lm.io.LmReaders.readArrayEncodedLmFromArpa(LmReaders.java:171)
        at edu.berkeley.nlp.lm.io.LmReaders.readArrayEncodedLmFromArpa(LmReaders.java:151)
        at dragon.lm.NGramLanguageModel.<init>(NGramLanguageModel.java:68)
        at dragon.lm.NGramLanguageModel.main(NGramLanguageModel.java:191)

Any guidance you could give me would be appreciated! The file is encoded as 
UTF-8.

Thanks.

Here's the version of Java I'm using:

    $ java -version
     java version "1.7.0_09"
     Java(TM) SE Runtime Environment (build 1.7.0_09-b05)
     Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode)

Original issue reported on code.google.com by samuel.m...@gmail.com on 17 Jul 2013 at 10:48

GoogleCodeExporter commented 9 years ago
Removing all non-alphanumeric characters (save spaces), the data set gets read 
in fine. 

It seems like there are some characters (` maybe?) that it can't handle. Is 
there a list somewhere I can find of unsupported characters? Is this sounding 
right to you?

Original comment by samuel.m...@gmail.com on 17 Jul 2013 at 11:12

GoogleCodeExporter commented 9 years ago
Can you print out the exact line that's failing? I don't know why it wouldn't 
handle special characters. 

Original comment by adpa...@google.com on 17 Jul 2013 at 11:35

GoogleCodeExporter commented 9 years ago
Do you mean the line in my ARPA file? I don't know, that's part of my problem.

Original comment by samuel.m...@gmail.com on 17 Jul 2013 at 11:36

GoogleCodeExporter commented 9 years ago
Here's the ARPA file that's not working.

Original comment by samuel.m...@gmail.com on 17 Jul 2013 at 11:37

Attachments:

GoogleCodeExporter commented 9 years ago
On line edu.berkeley.nlp.lm.io.ArpaLmReader.parseLine(ArpaLmReader.java:172), 
add a print statement that prints the |line| if the array is less than the 
number of spaces in |line| is less than |ngram.length|.

Original comment by adpa...@google.com on 17 Jul 2013 at 11:49

GoogleCodeExporter commented 9 years ago
Ah, I figured it out. My parser was giving my "words" that still had spaces in 
them, so I would write a unigram that the parser was interpreting as a bigram. 
(I am using stanford's NLP parser to parse files into sentences).

Your hint about spaces helped. Thanks!

Original comment by samuel.m...@gmail.com on 18 Jul 2013 at 3:20

GoogleCodeExporter commented 9 years ago

Original comment by adpa...@gmail.com on 18 Jul 2013 at 3:30