Twitter usernames and hashtags which being with a number are not correctly parsed, e.g.
RT @1310kfkanews: #1310kfkanews
is tokenized with the "@" and "#" as separate tokens.
1 RT rt NN pos2=NNP 3 nmod _ O
2 @ @ SYM pos2=IN 3 punct _ O
3 1310kfkanews 0kfkanews NN pos2=NNS 12 dep _ O
4 : : : pos2=, 12 punct _ O
5 # # NN pos2=SYM 6 compound _ U-CARDINAL
6 1310kfkanews 0kfkanews NN pos2=NNS 10 dep _ U-MONEY
I am working on a comparison of tokenizers for microblog texts, and am finding issues with nlpj 1.1.3 (from http://nlp.mathcs.emory.edu/nlp4j/nlp4j-appassembler-1.1.3.tgz).
Twitter usernames and hashtags which being with a number are not correctly parsed, e.g.
RT @1310kfkanews: #1310kfkanews
is tokenized with the "@" and "#" as separate tokens.
[This issue previously reported as https://github.com/emorynlp/nlp4j-tokenization/issues/11]