emorynlp / nlp4j

NLP framework for JVM languages.
http://emorynlp.github.io/nlp4j/
Other
149 stars 33 forks source link

Tokenizer: split colons which follow URLs #19

Open cakelly opened 8 years ago

cakelly commented 8 years ago

A complete URL followed by a colon really should be two tokens. E.g.

from http://t.co/GHDZ1Bsc: CO 71 is closed

is parsed:

5   from    from    IN  _   3   prep    _   O
6   http://t.co/GHDZ1Bsc:   #hlink# ADD pos2=NNP    11  nmod    _   O
7   CO  co  NNP pos2=IN 11  nmod    _   B-FAC
8   71  0   CD  _   7   nmod    _   L-FAC
9   is  be  VBZ _   10  auxpass _   O
10  closed  close   VBN pos2=JJ 11  nmod    _   O

Other parsers correctly tokenize the 2nd ":" as a separate token.