Closed reckart closed 5 months ago
Correction: This is actually not a backtick, but a special quote character.
Original issue reported on code.google.com by torsten.zesch
on 2014-08-07 14:05:20
Addendum: the stack trace indicates that this might actually be a problem of the segmenterbase,
not clearnlp
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.charAt(String.java:658)
at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.trim(SegmenterBase.java:262)
at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.createToken(SegmenterBase.java:239)
at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.createToken(SegmenterBase.java:228)
at de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSegmenter.process(ClearNlpSegmenter.java:102)
at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.process(SegmenterBase.java:124)
at de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSegmenter.process(ClearNlpSegmenter.java:74)
at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
Original issue reported on code.google.com by torsten.zesch
on 2014-08-07 14:11:33
The problem is in line 95 of ClearNlpSegmenter
tBegin = aText.indexOf(token, tBegin);
This yields -1 as the call to the underlying segmenter returns a string where ’ has
been changed to ' so the character cannot be found.
Anyone ideas how we could catch that in the wrapper?
Original issue reported on code.google.com by torsten.zesch
on 2014-08-07 14:20:21
The segmenter must not change the text ;)
Original issue reported on code.google.com by richard.eckart
on 2014-08-07 14:21:09
So I will file an issue upstream.
Original issue reported on code.google.com by torsten.zesch
on 2014-08-07 14:23:56
Actually I don't understand how a segmenter can change the text. In those instances
that I see, trim is called like this:
trim(aJCas.getDocumentText(), span);
Original issue reported on code.google.com by richard.eckart
on 2014-08-07 14:30:24
The span in this call is already [-1,1], as the output from the underlying segmenter
could not be found in the original string (line 95).
Original issue reported on code.google.com by torsten.zesch
on 2014-08-07 14:33:03
For the upstream tokenization code, it may be acceptable that the tokenizer also normalizes.
But if that it is the case, you could consider this:
Create a normalizer component that uses the tokenizer only to update the text and which
adds no segmentation.
Create a segmenter component that runs after the normalizer and which adds the token
boundaries *assuming* that the tokenizer does not try to normalize already normalized
text.
Theoretically, you could wrap both acts into one normalizer/segmenter component, but...
maybe not so nice.
... if you care to follow that way, drop me a note, because I have a nice CAS-multiplier-based
normalization base class lying around that still didn't make it into DKPro Core.
Otherwise, it would be good if the upstream tokenizer allowed some way of getting the
original offsets corresponding to the normalized tokens.
Original issue reported on code.google.com by richard.eckart
on 2014-08-07 14:38:17
I will probably not have time to go the long way with this issue in the near future.
I filed an issue upstream.
Original issue reported on code.google.com by torsten.zesch
on 2014-08-07 14:41:08
(No text was entered with this change)
Original issue reported on code.google.com by richard.eckart
on 2014-08-14 10:08:23
https://github.com/clearnlp/clearnlp/issues/5
Original issue reported on code.google.com by richard.eckart
on 2014-08-14 10:09:14
Upstream issue says this is fixed for version 3 of ClearNLP: https://github.com/clearnlp/clearnlp/issues/5#issuecomment-75635685
Just also hit this problem when tokenizing the String
Andre´
ClearNlpSegmenter is no longer supported upstream. Will be dropped here.
Original issue reported on code.google.com by
torsten.zesch
on 2014-08-07 14:03:45