Tokenizer breaks on empty strings, carriage returns ...

CogComp / cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

http://nlp.cogcomp.org/

Other

473 stars 142 forks source link

Tokenizer breaks on empty strings, carriage returns ... #278

Open nitishgupta opened 8 years ago

nitishgupta commented 8 years ago

I experienced that our latest tokenizer breaks on empty string "" and "\n" at the least. It produces an 'ArrayOutOfBound Exception'

danyaljj commented 8 years ago

So as you can see in #282, I added tests which contain empty string and new line, and the tests pass. So unless you give an explicit way to reproduce the exception, we won't be able to solver this.

nitishgupta commented 8 years ago

From #282 I learned that creating TextAnnotation using TextAnnotationBuilder fails on string that are empty, just carriage return, space etc. Code snippet that fails and gives 'java.lang.ArrayIndexOutOfBoundsException: -1' :

String text = " "; // or "\n" or "" 
TextAnnotationBuilder tab = new TokenizerTextAnnotationBuilder(new IllinoisTokenizer()); 
TextAnnotation ta = tab.createTextAnnotation(text);

nitishgupta commented 8 years ago

Thanks Daniel and Mark!

danyaljj commented 6 years ago

Re-opening the issue since apparently it still fails when receiving carriage returns (according to @nitishgupta )

nitishgupta commented 6 years ago

This fails when I do this in Python. I haven't checked in Java recently. Could ccg_nlpy be using an old version?

danyaljj commented 6 years ago

It is using a slightly older version; but very recent: https://github.com/CogComp/cogcomp-nlpy/blob/9e733c0d936cbd923a9e6702e36bc6d64ab887ad/ccg_nlpy/download.py#L14

Is there a way to verify that it is a purely a Java issue (and Python side is not to blame)?

nitishgupta commented 6 years ago

I can do that and update here.

nitishgupta commented 6 years ago

Space and new-line character both break the TextAnnotationBuilder.

TextAnnotationBuilder tab = new TokenizerTextAnnotationBuilder(new StatefulTokenizer());
tab.createTextAnnotation("", "", "\n");
tab.createTextAnnotation("", "", " ");

Error code:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
    at edu.illinois.cs.cogcomp.core.datastructures.textannotation.Constituent.<init>(Constituent.java:154)
    at edu.illinois.cs.cogcomp.core.datastructures.textannotation.Constituent.<init>(Constituent.java:108)
    at edu.illinois.cs.cogcomp.core.datastructures.textannotation.SpanLabelView.addSpanLabel(SpanLabelView.java:95)
    at edu.illinois.cs.cogcomp.core.datastructures.textannotation.TextAnnotation.<init>(TextAnnotation.java:88)
    at edu.illinois.cs.cogcomp.nlp.utility.TokenizerTextAnnotationBuilder.createTextAnnotation(TokenizerTextAnnotationBuilder.java:139)

mssammon commented 6 years ago

I don't think a TextAnnotation object should be created for such cases, because no annotations can be generated for them (at least, not with our current tooling). However, the failure should result in an explicit exception to this effect rather than resulting in a bounds exception.

mssammon commented 6 years ago

@nitishgupta what is the use case? would a TextAnnotation with no non-whitespace text be useful in some way?

nitishgupta commented 6 years ago

Having a TextAnnotation for these edge cases reduces the burden on the user to write checking-code. IMO, a TextAnnotation should be able to be created for any String object. Is that not the general consensus?

On Mon, Aug 6, 2018 at 4:39 PM Mark Sammons notifications@github.com wrote:

@nitishgupta https://github.com/nitishgupta what is the use case? would a TextAnnotation with no non-whitespace text be useful in some way?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CogComp/cogcomp-nlp/issues/278#issuecomment-410845845, or mute the thread https://github.com/notifications/unsubscribe-auth/AF71bUKYHfQFz1q8gr075pUqXcmIXetwks5uOKmQgaJpZM4K3z_v .

mssammon commented 6 years ago

I disagree. User of the resulting TextAnnotation will likely have to check for empty views -- or, the unlucky client who is using their code as an intermediary will. How is this better?

nitishgupta commented 6 years ago

Okay. I get your point. Thanks for the time.

On Mon, Aug 6, 2018 at 4:49 PM Mark Sammons notifications@github.com wrote:

I disagree. User of the resulting TextAnnotation will likely have to check for empty views -- or, the unlucky client who is using their code as an intermediary will. How is this better?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CogComp/cogcomp-nlp/issues/278#issuecomment-410848394, or mute the thread https://github.com/notifications/unsubscribe-auth/AF71bdE6YperycHTeLa6ac2dO6xvWAVgks5uOKvFgaJpZM4K3z_v .