Word tokenizer does not split apostrophe and apostrophe s

fnl / segtok

Segtok v2 is here: https://github.com/fnl/syntok -- A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features.

http://fnl.es/segtok-a-segmentation-and-tokenization-library.html

MIT License

170 stars 22 forks source link

Word tokenizer does not split apostrophe and apostrophe s #20

Open pwichmann opened 5 years ago

pwichmann commented 5 years ago

Is it possible that the word tokenizer does not split off apostrophe and apostrophe s: E.g. Toyota's is considered a single token as opposed to being split into Toyota and 's

This has caused me quite a bit of headache. Would it not be more common to split these?

fnl commented 5 years ago

Hi @pwichmann - have you seen the --split-contractions option here? https://github.com/fnl/segtok/blob/master/segtok/tokenizer.py#L344

Or the public split_contractions function to post-process tokens if you are using this programmatically, here? https://github.com/fnl/segtok/blob/master/segtok/tokenizer.py#L122

If you have, can you be specific about what isn't working for you when using that functionality?

pwichmann commented 5 years ago

I had not seen this. High likelihood of the user (me) being the problem, not the software. Will investigate.