Open farazbhinder opened 7 years ago
The estonian wiki document that causes this issue is https://et.wikipedia.org/wiki/?curid=16992
This document, in mongodb collection, has special unicode characters LSEP in it and I think they are causing this problem. On doing some debugging I found out that in AllLanguagesTokenizer.java
, the iterator in the line
FSIterator tokIt = jcas.getAnnotationIndex(Token.type).iterator();
of function sentenceTokenize
doesn't returns correct indices for the tokens.
For instance the first word of the estonian wiki document 16992 is Tallinna, so the Token t when assigned in the first iteration of while loop i.e. while(tokIt.hasNext())
by the command
t = (Token) tokIt.next();
t should have begin 0 and end 8, but it is having begin as -1 and end as 10.
The text of the document 16992 is attached e1.txt
Hi,
While running the heideltime on estonian wikipedia dump dated 2017-10-20 (available at https://dumps.wikimedia.org/etwiki/20171020/), AllLanguagesTokenizer runs into following exception message:
Please have a look into it. Thank-you.
Best regards, Faraz Ahmad