Tokenizer java.lang.StringIndexOutOfBoundsException

nartz commented 8 years ago

Hi - I'm running this on some text that is erroring out - its sensitive text so unfortunately I can't provide it, but may be able to dig into it more at some point in debugger. For now, it seems maybe that there is some error with bounds? This is with the nlp4j-1.1.1.jar (english model). I see some commits that recently rewrote some of this code, so maybe its fixed.

java.lang.StringIndexOutOfBoundsException: String index out of range: 44826 at java.lang.String.substring(String.java:1963) at edu.emory.mathcs.nlp.tokenization.Tokenizer.mergeParenthesis(Tokenizer.java:650) at edu.emory.mathcs.nlp.tokenization.Tokenizer.finalize(Tokenizer.java:608) at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenizeWhiteSpaces(Tokenizer.java:165) at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenize(Tokenizer.java:113) at edu.emory.mathcs.nlp.tokenization.Tokenizer.segmentize(Tokenizer.java:133) at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decodeRaw(AbstractNLPDecoder.java:221) at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decode(AbstractNLPDecoder.java:182) at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder$NLPTask.run(AbstractNLPDecoder.java:345) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

jdchoi77 commented 8 years ago

This is quite strange because Tokenizer#650 is a commented out line. I just released a snapshot: nlp4j-tokenization-1.1.2-SNAPSHOT with the latest code so could you please try out the snapshot and let me know? If this fixes the issue, I'll make another minor release. Thanks.

elithrion commented 8 years ago

I got the same error when trying to parse some (freely available) novels with all default settings with the 1.1.1 jar. One file that errored is attached.

java.lang.StringIndexOutOfBoundsException: String index out of range: 3423 at java.lang.String.substring(String.java:1963) at edu.emory.mathcs.nlp.tokenization.Tokenizer.mergeParenthesis(Tokenizer.java:650) at edu.emory.mathcs.nlp.tokenization.Tokenizer.finalize(Tokenizer.java:608) at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenizeWhiteSpaces(Tokenizer.java:165) ...

(I'll just leave the testing to you.)

Unearthly-2_5.txt

jdchoi77 commented 8 years ago

Thanks for providing the data; I fixed this bug and will include it in the next minor release (either tonight or tomorrow). Please sign up for our discussion group if you already haven't, so you'll get the notification for the new release.

https://groups.google.com/forum/#!forum/emorynlp

jdchoi77 commented 8 years ago

Sorry for taking it so long; I just released the version 1.1.2 which should have this fixed. Thanks.

emorynlp / nlp4j-tokenization

Tokenizer java.lang.StringIndexOutOfBoundsException #7