elitcloud / elit

🔠 Evolution of Language and Information Technology
https://elit.cloud
Other
47 stars 7 forks source link

Compound words tokenization failure #8

Open cathxiao opened 7 years ago

cathxiao commented 7 years ago

Expected Behavior

Compound words (e.g. pick-me-up, hand-me-down, know-it-all, etc.) should be tokenized as single tokens.

Actual Behavior

hyphens are treated as separators, and the components are tokenized separately.

jdchoi77 commented 7 years ago

These should be tokenized because they can occur without the hyphens (e.g., pick me up) and it should be tokenized consistently.