The first involves texts with fancy quotes, e.g. [ “@DevTheBarbie: ] | [ #Colorado’s ], which are being lumped into the same token as the twitter tokens they are precede or follow. The 's in "#Colorado’s" is a possessive and should be a separate token. Same for the opening " in " “@DevTheBarbie"
I am working on a comparison of tokenizers for microblog texts, and am finding issues with nlpj 1.1.3 (from http://nlp.mathcs.emory.edu/nlp4j/nlp4j-appassembler-1.1.3.tgz).
The first involves texts with fancy quotes, e.g. [ “@DevTheBarbie: ] | [ #Colorado’s ], which are being lumped into the same token as the twitter tokens they are precede or follow. The 's in "#Colorado’s" is a possessive and should be a separate token. Same for the opening " in " “@DevTheBarbie"
The online demo (http://nlp.mathcs.emory.edu:8080/nlp4j/NLP4JServlet) is handling these correctly, however.
I'm attaching the original input files, and the parses from NLP4J. 098.conll.txt 103.conll.txt
098.txt 103.txt
[This issue imported from emorynlp/nlp4j-tokenization#8]