Closed benson-basis closed 8 years ago
I did not have my facts straight here.
Just to explain the history, the original Penn Treebank didn't split the hyphens, which gave several issues, so the newer Treebank guidelines split the hyphen, which we adapted but the Stanford didn't (at least the last time I checked).
Thanks, I eventually sorted myself out, understood that, and switched to the Ontonotes5 PTB data, and all is well.
The tokenizer used in the PTB and UD corpora take this sentence:
Statford-upon-Avon is a junction on GWR.
and keeps the initial phrase as one token.
The Emory tokenizer splits it up, and then the dep parser does not do very well.
I'm not sure which direction to tweak this -- tokenizer or training data. Any advice?