emorynlp / nlp4j

NLP framework for JVM languages.
http://emorynlp.github.io/nlp4j/
Other
149 stars 33 forks source link

Twitter users and hashtags with leading numbers #18

Open cakelly opened 8 years ago

cakelly commented 8 years ago

I am working on a comparison of tokenizers for microblog texts, and am finding issues with nlpj 1.1.3 (from http://nlp.mathcs.emory.edu/nlp4j/nlp4j-appassembler-1.1.3.tgz).

Twitter usernames and hashtags which being with a number are not correctly parsed, e.g.

RT @1310kfkanews: #1310kfkanews

is tokenized with the "@" and "#" as separate tokens.

1   RT  rt  NN  pos2=NNP    3   nmod    _   O
2   @   @   SYM pos2=IN 3   punct   _   O
3   1310kfkanews    0kfkanews   NN  pos2=NNS    12  dep _   O
4   :   :   :   pos2=,  12  punct   _   O
5   #   #   NN  pos2=SYM    6   compound    _   U-CARDINAL
6   1310kfkanews    0kfkanews   NN  pos2=NNS    10  dep _   U-MONEY

[This issue previously reported as https://github.com/emorynlp/nlp4j-tokenization/issues/11]