Open pwichmann opened 5 years ago
Hi @pwichmann - have you seen the --split-contractions
option here?
https://github.com/fnl/segtok/blob/master/segtok/tokenizer.py#L344
Or the public split_contractions
function to post-process tokens if you are using this programmatically, here?
https://github.com/fnl/segtok/blob/master/segtok/tokenizer.py#L122
If you have, can you be specific about what isn't working for you when using that functionality?
I had not seen this. High likelihood of the user (me) being the problem, not the software. Will investigate.
Is it possible that the word tokenizer does not split off apostrophe and apostrophe s: E.g. Toyota's is considered a single token as opposed to being split into Toyota and 's
This has caused me quite a bit of headache. Would it not be more common to split these?