HeidelTime / heideltime

A multilingual, cross-domain temporal tagger developed at the Database Systems Research Group at Heidelberg University.
GNU General Public License v3.0
342 stars 67 forks source link

AllLanguagesTokenizer might run into StringIndexOutOfBoundsException #66

Open farazbhinder opened 6 years ago

farazbhinder commented 6 years ago

Hi,

While running the heideltime on estonian wikipedia dump dated 2017-10-20 (available at https://dumps.wikimedia.org/etwiki/20171020/), AllLanguagesTokenizer runs into following exception message:

org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:309)
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
    at org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.processNext(ProcessingUnit.java:893)
    at org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.run(ProcessingUnit.java:575)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.substring(String.java:1960)
    at org.apache.uima.jcas.tcas.Annotation.getCoveredText(Annotation.java:122)
    at de.unihd.dbs.uima.annotator.alllanguagestokenizer.AllLanguagesTokenizer.sentenceTokenize(AllLanguagesTokenizer.java:245)
    at de.unihd.dbs.uima.annotator.alllanguagestokenizer.AllLanguagesTokenizer.process(AllLanguagesTokenizer.java:34)
    at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
    ... 4 more

Please have a look into it. Thank-you.

Best regards, Faraz Ahmad

farazbhinder commented 6 years ago

The estonian wiki document that causes this issue is https://et.wikipedia.org/wiki/?curid=16992 This document, in mongodb collection, has special unicode characters LSEP in it and I think they are causing this problem. On doing some debugging I found out that in AllLanguagesTokenizer.java, the iterator in the line
FSIterator tokIt = jcas.getAnnotationIndex(Token.type).iterator(); of function sentenceTokenize doesn't returns correct indices for the tokens.

For instance the first word of the estonian wiki document 16992 is Tallinna, so the Token t when assigned in the first iteration of while loop i.e. while(tokIt.hasNext()) by the command t = (Token) tokIt.next(); t should have begin 0 and end 8, but it is having begin as -1 and end as 10.

The text of the document 16992 is attached e1.txt