emorynlp / nlp4j

NLP framework for JVM languages.
http://emorynlp.github.io/nlp4j/
Other
149 stars 33 forks source link

tokenizer issue: mis-split of improperly spelled words #23

Closed ezhou7 closed 7 years ago

ezhou7 commented 7 years ago

The most recent version of the tokenizer still has some problems when attempting to parse conversational words that have been truncated with an apostrophe. It causes the dependency tree of the sentence containing the said word to become malformed:

Example 1 - The first and second nodes should be just one node "'Cause" . 64 84 1 ' ' `` speaker=Joey 6 punct O O 64 84 2 Cause cause VB pos2=IN 4 mark O O 64 84 3 it it PRP 4 nsubj O O 64 84 4 's 's VBZ 0 root O O 64 84 5 always always RB 4 advmod O O 64 84 6 somethin somethin NN pos2=JJ 4 attr O O 64 84 7 ' ' '' pos2=XX 4 punct O O 64 84 8 , , , 10 punct O O 64 84 9 you you PRP 10 nsubj O U-Unknown 64 84 10 know know VBP 4 parataxis O O 64 84 11 , , , 10 punct O O 64 84 12 like like IN pos2=UH 4 prep O O 64 84 13 Monica monica NNP 16 poss U-PERSON U-Monica 64 84 14 's 's POS 13 case O O 64 84 15 new new JJ 16 nmod O O 64 84 16 job job NN 12 pobj O O 64 84 17 , , , 4 punct O O 64 84 18 or or CC 4 cc O O 64 84 19 the the DT 24 det O O 64 84 20 whole whole JJ pos2=NN 24 nmod O O 64 84 21 Ross ross NNP 24 poss U-ORG O 64 84 22 's 's POS 21 case O O 64 84 23 birthday birthday NN 24 compound O O 64 84 24 hoopla hoopla NN 25 xcomp O O 64 84 25 . . . 0 root O O

Example 2 - The sixth and seventh node should be just one node "sayin'". 37 46 1 Naa naa NNP speaker=Joey 4 vocative U-PERSON O 37 46 2 , , , 4 punct O O 37 46 3 you you PRP 4 nsubj O U-Chandler 37 46 4 're be VBP 0 root O O 37 46 5 just just RB 6 advmod O O 37 46 6 sayin sayin IN pos2=FW 4 prep O O 37 46 7 ' ' '' pos2=`` 6 punct O O 37 46 8 that that IN pos2=DT 11 mark O O 37 46 9 'cause 'cause IN pos2=UH 11 mark O O 37 46 10 you you PRP 11 nsubj O U-Chandler 37 46 11 're be VBP 4 advcl O O 37 46 12 in in IN 11 prep O O 37 46 13 love love NN 12 pobj O O 37 46 14 with with IN 13 prep O O 37 46 15 Yasmine yasmine NNP 16 compound B-PERSON B-Unknown 37 46 16 Blepe blepe NNP 14 pobj L-PERSON L-Unknown 37 46 17 . . . 4 punct _ O O

ezhou7 commented 7 years ago

Here's another error: 48 57 1 I I PRP speaker=Ross 2 nsubj O U-Ross 48 57 2 'm be VBP 0 root O O 48 57 3 sorry sorry JJ 2 acomp O O 48 57 4 , , , 7 punct O O 48 57 5 my my PRP$ 6 poss O U-Ross 48 57 6 pie pie NN 7 nsubj O O 48 57 7 was be VBD 3 ccomp O O 48 57 8 , , , 7 punct O O 48 57 9 was be VBD 7 dep O O 48 57 10 in in IN 9 prep O O 48 57 11 your your PRP$ 12 poss O U-Guy 48 57 12 hood hood NN 10 pobj O O 48 57 13 . . . 7 punct O O 48 58 1 Now now RB pos2=UH 4 advmod O O 48 58 2 I I PRP 4 nsubj O U-Ross 48 58 3 just just RB 4 advmod O O 48 58 4 have have VBP 0 root O O 48 58 5 to to TO 6 aux O O 48 58 6 get get VB 4 xcomp O O 48 58 7 the the DT 8 det O O 48 58 8 coffee coffee NN 6 dobj O O 48 58 9 out out IN pos2=RB 6 prep O O 48 58 10 of of IN 9 prep O O 49 59 1 that that DT speaker=Ross 2 det O O 49 59 2 guy guy NN 4 poss O U-Unknown 49 59 3 's 's POS 2 case O O 49 59 4 pants pant NNS 8 nsubj O O 49 59 5 and and CC 4 cc O O 49 59 6 I I PRP 4 conj O U-Ross 49 59 7 'll will MD 8 aux O O 49 59 8 be be VB 0 root O O 49 59 9 back back RB 8 advmod O O 49 59 10 in in IN 9 prep O O 49 59 11 the the DT 12 det O O 49 59 12 hospital hospital NN pos2=NNP 10 pobj O O 49 59 13 by by IN 8 prep O O 49 59 14 7. 0. . pos2=CD 8 punct U-CARDINAL O

Last node should be two nodes, one for "7" and one for ".".

ezhou7 commented 7 years ago

Another error: 18 34 11 I I PRP 12 nsubj O U-Ross 18 34 12 'm be VBP 32 ccomp O O 18 34 13 thinkin thinkin JJ pos2=XX 12 acomp O O 18 34 14 ' ' '' pos2=`` 12 punct O O 18 34 15 when when WRB 17 advmod O O 18 34 16 she she PRP 17 nsubj O U-Jane 18 34 17 sees see VBZ 24 advcl O O 18 34 18 you you PRP 17 dobj O U-Chandler

Nodes 13 and 14 should be one word.

ezhou7 commented 7 years ago

Another one: 1 1 1 Hey hey UH speaker=JOEY 4 discourse O O 1 1 2 , , , 4 punct O O 1 1 3 whaddya whaddya PRP pos2=MD 4 nsubj O O 1 1 4 wan wan VBP pos2=MD 0 root O O 1 1 5 na to TO pos2=NN 6 aux O O 1 1 6 do do VB pos2=VBP 4 xcomp O O 1 1 7 for for IN 6 prep O O 1 1 8 dinner dinner NN 7 pobj O O 1 1 9 ? ? . 4 punct _ O O

"Whaddya" should be split into two nodes.

ezhou7 commented 7 years ago

Another one: 12 17 1 Wonderfullness wonderfullness NNP pos2=NN 3 nsubjpass U-ORG O 12 17 2 is be VBZ 3 auxpass O O 12 17 3 baked bake VBN pos2=JJ 0 root O O 12 17 4 right right JJ pos2=RB 3 oprd O O 12 17 5 in. in. . pos2=NN 3 punct O O

Last node should be split into two.