Not sure how to characterize this, other than the tokenizer does not seem to do enough error handling or bounds checking. I tried to reduce the input as much as possible to reproduce the issue. Would appreciate any feedback on how to pre-process input data.
keywords: {words},
URL: http://anyurl.com A. Abbot, "Help BL(1) nephew,"
This appears to break the online demo as well.
java.lang.StringIndexOutOfBoundsException: String index out of range: 63
at java.lang.String.substring(String.java:1963)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.mergeParenthesis(Tokenizer.java:650)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.finalize(Tokenizer.java:608)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenizeWhiteSpaces(Tokenizer.java:165)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenize(Tokenizer.java:113)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.segmentize(Tokenizer.java:133)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decodeRaw(AbstractNLPDecoder.java:221)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decode(AbstractNLPDecoder.java:182)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder$NLPTask.run(AbstractNLPDecoder.java:345)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
special7.txt config-decode-en.xml.txt
Not sure how to characterize this, other than the tokenizer does not seem to do enough error handling or bounds checking. I tried to reduce the input as much as possible to reproduce the issue. Would appreciate any feedback on how to pre-process input data.
Running from command line: java -Xmx4g -XX:+UseConcMarkSweepGC -cp nlp4j-1.1.1.jar edu.emory.mathcs.nlp.bin/NLPDecode -c config-decode-en.xml -i special7.txt
special7.txt contains:
keywords: {words}, URL: http://anyurl.com A. Abbot, "Help BL(1) nephew,"
This appears to break the online demo as well.
java.lang.StringIndexOutOfBoundsException: String index out of range: 63 at java.lang.String.substring(String.java:1963) at edu.emory.mathcs.nlp.tokenization.Tokenizer.mergeParenthesis(Tokenizer.java:650) at edu.emory.mathcs.nlp.tokenization.Tokenizer.finalize(Tokenizer.java:608) at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenizeWhiteSpaces(Tokenizer.java:165) at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenize(Tokenizer.java:113) at edu.emory.mathcs.nlp.tokenization.Tokenizer.segmentize(Tokenizer.java:133) at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decodeRaw(AbstractNLPDecoder.java:221) at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decode(AbstractNLPDecoder.java:182) at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder$NLPTask.run(AbstractNLPDecoder.java:345) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)