CogComp / cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
http://nlp.cogcomp.org/
Other
470 stars 144 forks source link

Curator tokenization has changed? #553

Closed danyaljj closed 7 years ago

danyaljj commented 7 years ago

I keep getting this exception when trying to copy a view from a TextAnnotation created by curator to a TextAnnotation created by pipeline.

[error] java.lang.IllegalArgumentException: Span [0, 1] already labeled.
[error]     at edu.illinois.cs.cogcomp.core.datastructures.textannotation.SpanLabelView.addSpanLabel(SpanLabelView.java:89)
[error]     at edu.illinois.cs.cogcomp.core.datastructures.textannotation.TokenLabelView.addTokenLabel(TokenLabelView.java:45)
[error]     at edu.illinois.cs.cogcomp.curator.CuratorDataStructureInterface.alignLabelingToTokenLabelView(CuratorDataStructureInterface.java:501)
[error]     at edu.illinois.cs.cogcomp.curator.CuratorClient.getTextAnnotationView(CuratorClient.java:232)
[error]     at edu.illinois.cs.cogcomp.curator.CuratorAnnotator.addView(CuratorAnnotator.java:52)
[error]     at edu.illinois.cs.cogcomp.annotation.Annotator.lazyAddView(Annotator.java:181)
[error]     at edu.illinois.cs.cogcomp.annotation.Annotator.getView(Annotator.java:166)
[error]     at edu.illinois.cs.cogcomp.core.datastructures.textannotation.TextAnnotation.addView(TextAnnotation.java:109)
[error]     at edu.illinois.cs.cogcomp.curator.CuratorAnnotatorService.addView(CuratorAnnotatorService.java:257)
[error]     at edu.illinois.cs.cogcomp.curator.CuratorAnnotatorService.addView(CuratorAnnotatorService.java:255)
[error]     at edu.illinois.cs.cogcomp.curator.CuratorAnnotatorService.createAnnotatedTextAnnotation(CuratorAnnotatorService.java:205)
[error]     at org.allenai.ari.solvers.textilp.utils.AnnotationUtils.annotateWithCuratorAndSaveUnderName(AnnotationUtils.scala:162)
[error]     at org.allenai.ari.solvers.textilp.utils.AnnotationUtils.annotateWithEverything(AnnotationUtils.scala:529)
[error]     at org.allenai.ari.solvers.textilp.ExperimentsApp$$anonfun$cacheOnDisk$1$1.apply(ExperimentsApp.scala:3140)
[error]     at org.allenai.ari.solvers.textilp.ExperimentsApp$$anonfun$cacheOnDisk$1$1.apply(ExperimentsApp.scala:3132)
[error]     at scala.collection.immutable.List.foreach(List.scala:381)
[error]     at org.allenai.ari.solvers.textilp.ExperimentsApp$.cacheOnDisk$1(ExperimentsApp.scala:3132)
[error]     at org.allenai.ari.solvers.textilp.ExperimentsApp$.main(ExperimentsApp.scala:3144)
[error]     at org.allenai.ari.solvers.textilp.ExperimentsApp.main(ExperimentsApp.scala)
mssammon commented 7 years ago

for all inputs, or just for some? Is this a change in behavior for data you tried before, but successfully the previous time?

danyaljj commented 7 years ago

It happens for many inputs. And I think it started 2-3 weeks ago.

danyaljj commented 7 years ago

I'll dig deeper into the details of the issue and give more details

danyaljj commented 7 years ago

Updated the error msg.

danyaljj commented 7 years ago

It went away after I started using tokenized text as input to curator.

mssammon commented 7 years ago

You mean, whitespace tokenized? which curator server are you using? -- I set up a second curator instance on a different host/port that accepts tokenized text; are you referring to that?

mssammon commented 7 years ago

-- this new curator went online just two or three days ago though

danyaljj commented 7 years ago

I think the old curator also supports it? I'm using the old one. I turned this on and the issue went away.

mssammon commented 7 years ago

This worries me. This flag is meant to be used if you send curator whitespace-tokenized text, to force it not to re-tokenize already tokenized text. Since char offsets can't be easily preserved, this was not used much (as I recall, anyway). Technically we could use this with the StringTransformation object to track/restore original token char offsets. But I don't see why you would have problems with this flag set to 'false', unless the tokenizer is disabled. If you call the austen/9011 curator (the modified one), you will get an error if you don't call it with a TextAnnotation with token and sentence views, b/c I had to disable the local tokenization. But trollope/9010 should work just fine -- nothing should be changed there.