CogComp / cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
http://nlp.cogcomp.org/
Other
471 stars 142 forks source link

Tokenization issue #452

Closed danyaljj closed 7 years ago

danyaljj commented 7 years ago

Tom could you look at this tokenization issue?

        String text = "You see always, oh we're going to do this, we're going to--. ";
        TextAnnotation basicTextAnnotation = null;
        try {
            basicTextAnnotation = processor.createBasicTextAnnotation("test", "test", text);
        } catch (AnnotatorException e) {
            e.printStackTrace();
            fail(e.getMessage());
        }
    }

output:

java.lang.StringIndexOutOfBoundsException: String index out of range: 0

    at java.lang.String.charAt(String.java:658)
    at edu.illinois.cs.cogcomp.nlp.tokenizer.TokenizerStateMachine$State.isAbbr(TokenizerStateMachine.java:676)
    at edu.illinois.cs.cogcomp.nlp.tokenizer.TokenizerStateMachine$5.process(TokenizerStateMachine.java:314)
    at edu.illinois.cs.cogcomp.nlp.tokenizer.TokenizerStateMachine.parseText(TokenizerStateMachine.java:610)
    at edu.illinois.cs.cogcomp.nlp.tokenizer.StatefulTokenizer.tokenizeTextSpan(StatefulTokenizer.java:79)
    at edu.illinois.cs.cogcomp.nlp.utility.TokenizerTextAnnotationBuilder.createTextAnnotation(TokenizerTextAnnotationBuilder.java:83)
    at edu.illinois.cs.cogcomp.annotation.BasicAnnotatorService.createBasicTextAnnotation(BasicAnnotatorService.java:165)
    at edu.illinois.cs.cogcomp.pipeline.main.CachingPipelineTest.weirdSentences(CachingPipelineTest.java:218)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.junit.internal.runners.TestMethod.invoke(TestMethod.java:59)
    at org.junit.internal.runners.MethodRoadie.runTestMethod(MethodRoadie.java:98)
    at org.junit.internal.runners.MethodRoadie$2.run(MethodRoadie.java:79)
    at org.junit.internal.runners.MethodRoadie.runBeforesThenTestThenAfters(MethodRoadie.java:87)
    at org.junit.internal.runners.MethodRoadie.runTest(MethodRoadie.java:77)
    at org.junit.internal.runners.MethodRoadie.run(MethodRoadie.java:42)
    at org.junit.internal.runners.JUnit4ClassRunner.invokeTestMethod(JUnit4ClassRunner.java:88)
    at org.junit.internal.runners.JUnit4ClassRunner.runMethods(JUnit4ClassRunner.java:51)
    at org.junit.internal.runners.JUnit4ClassRunner$1.run(JUnit4ClassRunner.java:44)
    at org.junit.internal.runners.ClassRoadie.runUnprotected(ClassRoadie.java:27)
    at org.junit.internal.runners.ClassRoadie.runProtected(ClassRoadie.java:37)
    at org.junit.internal.runners.JUnit4ClassRunner.run(JUnit4ClassRunner.java:42)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
    at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
    at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
    at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:237)
    at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

FYI @mssammon

cowchipkid commented 7 years ago

Ya, I’ll fix this first thing tomorrow AM.

On May 4, 2017, at 5:16 PM, Daniel Khashabi notifications@github.com wrote:

Tom could you look at this tokenization issue?

    String text = "You see always, oh we're going to do this, we're going to--. ";
    TextAnnotation basicTextAnnotation = null;
    try {
        basicTextAnnotation = processor.createBasicTextAnnotation("test", "test", text);
    } catch (AnnotatorException e) {
        e.printStackTrace();
        fail(e.getMessage());
    }
}

output:

java.lang.StringIndexOutOfBoundsException: String index out of range: 0

at java.lang.String.charAt(String.java:658) at edu.illinois.cs.cogcomp.nlp.tokenizer.TokenizerStateMachine$State.isAbbr(TokenizerStateMachine.java:676) at edu.illinois.cs.cogcomp.nlp.tokenizer.TokenizerStateMachine$5.process(TokenizerStateMachine.java:314) at edu.illinois.cs.cogcomp.nlp.tokenizer.TokenizerStateMachine.parseText(TokenizerStateMachine.java:610) at edu.illinois.cs.cogcomp.nlp.tokenizer.StatefulTokenizer.tokenizeTextSpan(StatefulTokenizer.java:79) at edu.illinois.cs.cogcomp.nlp.utility.TokenizerTextAnnotationBuilder.createTextAnnotation(TokenizerTextAnnotationBuilder.java:83) at edu.illinois.cs.cogcomp.annotation.BasicAnnotatorService.createBasicTextAnnotation(BasicAnnotatorService.java:165) at edu.illinois.cs.cogcomp.pipeline.main.CachingPipelineTest.weirdSentences(CachingPipelineTest.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.internal.runners.TestMethod.invoke(TestMethod.java:59) at org.junit.internal.runners.MethodRoadie.runTestMethod(MethodRoadie.java:98) at org.junit.internal.runners.MethodRoadie$2.run(MethodRoadie.java:79) at org.junit.internal.runners.MethodRoadie.runBeforesThenTestThenAfters(MethodRoadie.java:87) at org.junit.internal.runners.MethodRoadie.runTest(MethodRoadie.java:77) at org.junit.internal.runners.MethodRoadie.run(MethodRoadie.java:42) at org.junit.internal.runners.JUnit4ClassRunner.invokeTestMethod(JUnit4ClassRunner.java:88) at org.junit.internal.runners.JUnit4ClassRunner.runMethods(JUnit4ClassRunner.java:51) at org.junit.internal.runners.JUnit4ClassRunner$1.run(JUnit4ClassRunner.java:44) at org.junit.internal.runners.ClassRoadie.runUnprotected(ClassRoadie.java:27) at org.junit.internal.runners.ClassRoadie.runProtected(ClassRoadie.java:37) at org.junit.internal.runners.JUnit4ClassRunner.run(JUnit4ClassRunner.java:42) at org.junit.runner.JUnitCore.run(JUnitCore.java:130) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:237) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

FYI @mssammon https://github.com/mssammon — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/CogComp/cogcomp-nlp/issues/452, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdHS2aQM3pzavo3CXM_i2RYU2Ge96-jks5r2k49gaJpZM4NRSRg.

cowchipkid commented 7 years ago

@mssammon @danyaljj I have fixed this. However, I have ongoing development in my fork wrt the OntoNotes 5.0 parser. What do we do in situations like this? This is a very minor fix, should we be creating branches for these one offs? In this case, can I wait till we are ready to merge my fork?

mssammon commented 7 years ago

@cowchipkid If the ontonotes parser is going to take longer than a few hours to complete, please create a separate branch and PR for just the tokenizer fix.