emorynlp / nlp4j

NLP framework for JVM languages.
http://emorynlp.github.io/nlp4j/
Other
149 stars 33 forks source link

Tokens with fancy quotes are being merged #16

Open cakelly opened 8 years ago

cakelly commented 8 years ago

I am working on a comparison of tokenizers for microblog texts, and am finding issues with nlpj 1.1.3 (from http://nlp.mathcs.emory.edu/nlp4j/nlp4j-appassembler-1.1.3.tgz).

The first involves texts with fancy quotes, e.g. [ “@DevTheBarbie: ] | [ #Colorado’s ], which are being lumped into the same token as the twitter tokens they are precede or follow. The 's in "#Colorado’s" is a possessive and should be a separate token. Same for the opening " in " “@DevTheBarbie"

The online demo (http://nlp.mathcs.emory.edu:8080/nlp4j/NLP4JServlet) is handling these correctly, however.

I'm attaching the original input files, and the parses from NLP4J. 098.conll.txt 103.conll.txt

098.txt 103.txt

[This issue imported from emorynlp/nlp4j-tokenization#8]