Open nitishgupta opened 8 years ago
So as you can see in #282, I added tests which contain empty string and new line, and the tests pass. So unless you give an explicit way to reproduce the exception, we won't be able to solver this.
From #282 I learned that creating TextAnnotation using TextAnnotationBuilder fails on string that are empty, just carriage return, space etc. Code snippet that fails and gives 'java.lang.ArrayIndexOutOfBoundsException: -1' :
String text = " "; // or "\n" or ""
TextAnnotationBuilder tab = new TokenizerTextAnnotationBuilder(new IllinoisTokenizer());
TextAnnotation ta = tab.createTextAnnotation(text);
Thanks Daniel and Mark!
Re-opening the issue since apparently it still fails when receiving carriage returns (according to @nitishgupta )
This fails when I do this in Python. I haven't checked in Java recently. Could ccg_nlpy be using an old version?
It is using a slightly older version; but very recent: https://github.com/CogComp/cogcomp-nlpy/blob/9e733c0d936cbd923a9e6702e36bc6d64ab887ad/ccg_nlpy/download.py#L14
Is there a way to verify that it is a purely a Java issue (and Python side is not to blame)?
I can do that and update here.
Space and new-line character both break the TextAnnotationBuilder.
TextAnnotationBuilder tab = new TokenizerTextAnnotationBuilder(new StatefulTokenizer());
tab.createTextAnnotation("", "", "\n");
tab.createTextAnnotation("", "", " ");
Error code:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
at edu.illinois.cs.cogcomp.core.datastructures.textannotation.Constituent.<init>(Constituent.java:154)
at edu.illinois.cs.cogcomp.core.datastructures.textannotation.Constituent.<init>(Constituent.java:108)
at edu.illinois.cs.cogcomp.core.datastructures.textannotation.SpanLabelView.addSpanLabel(SpanLabelView.java:95)
at edu.illinois.cs.cogcomp.core.datastructures.textannotation.TextAnnotation.<init>(TextAnnotation.java:88)
at edu.illinois.cs.cogcomp.nlp.utility.TokenizerTextAnnotationBuilder.createTextAnnotation(TokenizerTextAnnotationBuilder.java:139)
I don't think a TextAnnotation object should be created for such cases, because no annotations can be generated for them (at least, not with our current tooling). However, the failure should result in an explicit exception to this effect rather than resulting in a bounds exception.
@nitishgupta what is the use case? would a TextAnnotation with no non-whitespace text be useful in some way?
Having a TextAnnotation for these edge cases reduces the burden on the user to write checking-code. IMO, a TextAnnotation should be able to be created for any String object. Is that not the general consensus?
On Mon, Aug 6, 2018 at 4:39 PM Mark Sammons notifications@github.com wrote:
@nitishgupta https://github.com/nitishgupta what is the use case? would a TextAnnotation with no non-whitespace text be useful in some way?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CogComp/cogcomp-nlp/issues/278#issuecomment-410845845, or mute the thread https://github.com/notifications/unsubscribe-auth/AF71bUKYHfQFz1q8gr075pUqXcmIXetwks5uOKmQgaJpZM4K3z_v .
I disagree. User of the resulting TextAnnotation will likely have to check for empty views -- or, the unlucky client who is using their code as an intermediary will. How is this better?
Okay. I get your point. Thanks for the time.
On Mon, Aug 6, 2018 at 4:49 PM Mark Sammons notifications@github.com wrote:
I disagree. User of the resulting TextAnnotation will likely have to check for empty views -- or, the unlucky client who is using their code as an intermediary will. How is this better?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CogComp/cogcomp-nlp/issues/278#issuecomment-410848394, or mute the thread https://github.com/notifications/unsubscribe-auth/AF71bdE6YperycHTeLa6ac2dO6xvWAVgks5uOKvFgaJpZM4K3z_v .
I experienced that our latest tokenizer breaks on empty string "" and "\n" at the least. It produces an 'ArrayOutOfBound Exception'