This version of NTLK tokenizer is working nicely on things like dont, gonna and gotta, e.g.:
10 gon gon VBG _ 7 ccomp _ O
11 na to TO _ 12 aux _ O
Others, not so nicely: e.g. ill and theres and its (see below). Also im, shes, and my favorite, ima ("I'm going to").
1 So so RB _ 2 advmod _ O
2 **ill** ill JJ pos2=UH 3 advmod _ O
3 decide decide VB pos2=VBP 0 root _ O
1 **Theres** theres RB pos2=NNS 8 advmod _ O
2 like like IN pos2=UH 7 prep _ O
3 7 0 CD _ 4 nmod _ U-CARDINAL
4 fires fire NNS _ 2 pobj _ O
1 **its** its PRP$ _ 3 poss _ O
2 been be VBN _ 0 root _ O
3 years years NNS pos2=RB 2 attr _ O
Another example is outta which really means "out of", which might nicely be kept as a single token with part of speech IN, but which currently tagged as a JJ by nlp4j (but is tagged as "IN" by other taggers, and kept as one token). And "y'all", which I suggest keeping as a single token PRP, as nlp4j currently does.
I am working on a comparison of tokenizers for microblog texts, and am finding issues with nlpj 1.1.3 (from http://nlp.mathcs.emory.edu/nlp4j/nlp4j-appassembler-1.1.3.tgz).
This version of NTLK tokenizer is working nicely on things like dont, gonna and gotta, e.g.:
Others, not so nicely: e.g. ill and theres and its (see below). Also im, shes, and my favorite, ima ("I'm going to").
Another example is outta which really means "out of", which might nicely be kept as a single token with part of speech IN, but which currently tagged as a JJ by nlp4j (but is tagged as "IN" by other taggers, and kept as one token). And "y'all", which I suggest keeping as a single token PRP, as nlp4j currently does.
Finally,