This issue involves html-encoded characters such as & and < which are split into separate characters. This is not a problem with other tokenizers. E.g. the following text
**I'd like this & that < the other**
is parsed as
4 this this DT _ 3 dobj _ O
5 & & CC _ 4 cc _ O
6 amp amp NN _ 4 conj _ O
7 ; ; , pos2=: 3 punct _ O
8 that that DT pos2=WDT 3 dep _ O
9 & & CC pos2=NFP 8 cc _ O
10 lt lt JJ pos2=NN 8 conj _ O
11 ; ; : pos2=, 13 punct _ O
[This issue imported from https://github.com/emorynlp/nlp4j-tokenization/issues/9] I am working on a comparison of tokenizers for microblog texts, and am finding issues with nlpj 1.1.3 (from http://nlp.mathcs.emory.edu/nlp4j/nlp4j-appassembler-1.1.3.tgz).
This issue involves html-encoded characters such as & and < which are split into separate characters. This is not a problem with other tokenizers. E.g. the following text
is parsed as