dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

ClearNlpSegmenter chokes on character: ’ #445

Closed reckart closed 5 months ago

reckart commented 9 years ago
Having a backtick in a string yields an exception in ClearNlpSegmenter.

I guess this should also be reported in clearnlp, but I wanted to document this here
too so we can adapt the wrapper in case this isn't resolved at the source. 

Original issue reported on code.google.com by torsten.zesch on 2014-08-07 14:03:45

reckart commented 9 years ago
Correction: This is actually not a backtick, but a special quote character.

Original issue reported on code.google.com by torsten.zesch on 2014-08-07 14:05:20

reckart commented 9 years ago
Addendum: the stack trace indicates that this might actually be a problem of the segmenterbase,
not clearnlp

Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.charAt(String.java:658)
    at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.trim(SegmenterBase.java:262)
    at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.createToken(SegmenterBase.java:239)
    at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.createToken(SegmenterBase.java:228)
    at de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSegmenter.process(ClearNlpSegmenter.java:102)
    at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.process(SegmenterBase.java:124)
    at de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSegmenter.process(ClearNlpSegmenter.java:74)
    at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)

Original issue reported on code.google.com by torsten.zesch on 2014-08-07 14:11:33

reckart commented 9 years ago
The problem is in line 95 of ClearNlpSegmenter
tBegin = aText.indexOf(token, tBegin);
This yields -1 as the call to the underlying segmenter returns a string where ’ has
been changed to ' so the character cannot be found.

Anyone ideas how we could catch that in the wrapper?

Original issue reported on code.google.com by torsten.zesch on 2014-08-07 14:20:21

reckart commented 9 years ago
The segmenter must not change the text ;)

Original issue reported on code.google.com by richard.eckart on 2014-08-07 14:21:09

reckart commented 9 years ago
So I will file an issue upstream.

Original issue reported on code.google.com by torsten.zesch on 2014-08-07 14:23:56

reckart commented 9 years ago
Actually I don't understand how a segmenter can change the text. In those instances
that I see, trim is called like this:

trim(aJCas.getDocumentText(), span);

Original issue reported on code.google.com by richard.eckart on 2014-08-07 14:30:24

reckart commented 9 years ago
The span in this call is already [-1,1], as the output from the underlying segmenter
could not be found in the original string (line 95).

Original issue reported on code.google.com by torsten.zesch on 2014-08-07 14:33:03

reckart commented 9 years ago
For the upstream tokenization code, it may be acceptable that the tokenizer also normalizes.
But if that it is the case, you could consider this:

Create a normalizer component that uses the tokenizer only to update the text and which
adds no segmentation.
Create a segmenter component that runs after the normalizer and which adds the token
boundaries *assuming* that the tokenizer does not try to normalize already normalized
text.
Theoretically, you could wrap both acts into one normalizer/segmenter component, but...
maybe not so nice.
... if you care to follow that way, drop me a note, because I have a nice CAS-multiplier-based
normalization base class lying around that still didn't make it into DKPro Core.

Otherwise, it would be good if the upstream tokenizer allowed some way of getting the
original offsets corresponding to the normalized tokens.

Original issue reported on code.google.com by richard.eckart on 2014-08-07 14:38:17

reckart commented 9 years ago
I will probably not have time to go the long way with this issue in the near future.

I filed an issue upstream.

Original issue reported on code.google.com by torsten.zesch on 2014-08-07 14:41:08

reckart commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2014-08-14 10:08:23

reckart commented 9 years ago
https://github.com/clearnlp/clearnlp/issues/5

Original issue reported on code.google.com by richard.eckart on 2014-08-14 10:09:14

reckart commented 8 years ago

Upstream issue says this is fixed for version 3 of ClearNLP: https://github.com/clearnlp/clearnlp/issues/5#issuecomment-75635685

Just also hit this problem when tokenizing the String

Andre´
reckart commented 5 months ago

ClearNlpSegmenter is no longer supported upstream. Will be dropped here.