Tokenization issues with hyphens

emorynlp / nlp4j

NLP framework for JVM languages.

http://emorynlp.github.io/nlp4j/

Other

149 stars 33 forks source link

Tokenization issues with hyphens #3

Closed benson-basis closed 8 years ago

benson-basis commented 8 years ago

The tokenizer used in the PTB and UD corpora take this sentence:

Statford-upon-Avon is a junction on GWR.

and keeps the initial phrase as one token.

The Emory tokenizer splits it up, and then the dep parser does not do very well.

I'm not sure which direction to tweak this -- tokenizer or training data. Any advice?

benson-basis commented 8 years ago

I did not have my facts straight here.

jdchoi77 commented 8 years ago

Just to explain the history, the original Penn Treebank didn't split the hyphens, which gave several issues, so the newer Treebank guidelines split the hyphen, which we adapted but the Stanford didn't (at least the last time I checked).

benson-basis commented 8 years ago

Thanks, I eventually sorted myself out, understood that, and switched to the Ontonotes5 PTB data, and all is well.