emorynlp / nlp4j-old

NLP tools developed by Emory University.
Other
60 stars 19 forks source link

StringIndexOutOfBoundsException #20

Open nickvido opened 8 years ago

nickvido commented 8 years ago

special7.txt config-decode-en.xml.txt

Not sure how to characterize this, other than the tokenizer does not seem to do enough error handling or bounds checking. I tried to reduce the input as much as possible to reproduce the issue. Would appreciate any feedback on how to pre-process input data.

Running from command line: java -Xmx4g -XX:+UseConcMarkSweepGC -cp nlp4j-1.1.1.jar edu.emory.mathcs.nlp.bin/NLPDecode -c config-decode-en.xml -i special7.txt

special7.txt contains:

keywords: {words}, URL: http://anyurl.com A. Abbot, "Help BL(1) nephew,"

This appears to break the online demo as well.

java.lang.StringIndexOutOfBoundsException: String index out of range: 63 at java.lang.String.substring(String.java:1963) at edu.emory.mathcs.nlp.tokenization.Tokenizer.mergeParenthesis(Tokenizer.java:650) at edu.emory.mathcs.nlp.tokenization.Tokenizer.finalize(Tokenizer.java:608) at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenizeWhiteSpaces(Tokenizer.java:165) at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenize(Tokenizer.java:113) at edu.emory.mathcs.nlp.tokenization.Tokenizer.segmentize(Tokenizer.java:133) at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decodeRaw(AbstractNLPDecoder.java:221) at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decode(AbstractNLPDecoder.java:182) at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder$NLPTask.run(AbstractNLPDecoder.java:345) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

jdchoi77 commented 8 years ago

I believe this is the same issue: https://github.com/emorynlp/nlp4j-tokenization/issues/7

Sorry for the bug; this should be fixed in version 1.1.2 which just got released. Please try the new version. Thanks.