emorynlp / nlp4j-tokenization

Tokenize raw texts into tokens and sentences.
Other
6 stars 4 forks source link

Malformed contractions not being split #10

Open cakelly opened 8 years ago

cakelly commented 8 years ago

I am working on a comparison of tokenizers for microblog texts, and am finding issues with nlpj 1.1.3 (from http://nlp.mathcs.emory.edu/nlp4j/nlp4j-appassembler-1.1.3.tgz).

This version of NTLK tokenizer is working nicely on things like dont, gonna and gotta, e.g.:

10  gon gon VBG _   7   ccomp   _   O
11  na  to  TO  _   12  aux _   O

Others, not so nicely: e.g. ill and theres and its (see below). Also im, shes, and my favorite, ima ("I'm going to").

1   So  so  RB  _   2   advmod  _   O
2   **ill** ill JJ  pos2=UH 3   advmod  _   O
3   decide  decide  VB  pos2=VBP    0   root    _   O

1   **Theres**  theres  RB  pos2=NNS    8   advmod  _   O
2   like    like    IN  pos2=UH 7   prep    _   O
3   7   0   CD  _   4   nmod    _   U-CARDINAL
4   fires   fire    NNS _   2   pobj    _   O

1   **its** its PRP$    _   3   poss    _   O
2   been    be  VBN _   0   root    _   O
3   years   years   NNS pos2=RB 2   attr    _   O

Another example is outta which really means "out of", which might nicely be kept as a single token with part of speech IN, but which currently tagged as a JJ by nlp4j (but is tagged as "IN" by other taggers, and kept as one token). And "y'all", which I suggest keeping as a single token PRP, as nlp4j currently does.

Finally,