emorynlp / nlp4j

NLP framework for JVM languages.
http://emorynlp.github.io/nlp4j/
Other
149 stars 33 forks source link

Tokenization of html UTF-8 chars #15

Open cakelly opened 8 years ago

cakelly commented 8 years ago

[This issue imported from https://github.com/emorynlp/nlp4j-tokenization/issues/9] I am working on a comparison of tokenizers for microblog texts, and am finding issues with nlpj 1.1.3 (from http://nlp.mathcs.emory.edu/nlp4j/nlp4j-appassembler-1.1.3.tgz).

This issue involves html-encoded characters such as & and < which are split into separate characters. This is not a problem with other tokenizers. E.g. the following text

**I'd like this & that < the other**

is parsed as

4   this    this    DT  _   3   dobj    _   O
5   &   &   CC  _   4   cc  _   O
6   amp amp NN  _   4   conj    _   O
7   ;   ;   ,   pos2=:  3   punct   _   O
8   that    that    DT  pos2=WDT    3   dep _   O
9   &   &   CC  pos2=NFP    8   cc  _   O
10  lt  lt  JJ  pos2=NN 8   conj    _   O
11  ;   ;   :   pos2=,  13  punct   _   O