Open cathxiao opened 7 years ago
Compound words (e.g. pick-me-up, hand-me-down, know-it-all, etc.) should be tokenized as single tokens.
hyphens are treated as separators, and the components are tokenized separately.
These should be tokenized because they can occur without the hyphens (e.g., pick me up) and it should be tokenized consistently.
Expected Behavior
Compound words (e.g. pick-me-up, hand-me-down, know-it-all, etc.) should be tokenized as single tokens.
Actual Behavior
hyphens are treated as separators, and the components are tokenized separately.