When this is done I think that it should be the default tokenizer.
Because multilingual and doesn't screw up URLs
This code is untested, and not yet linked in.
I just ported the perl to sed.
Well to the PCRE extended sed that we actually use.
Which doesn't involve much.
Will want to port over NLTK's tests, which are hopefully comprehensive enough to check that I didn't mess anything up
I have started to port TokTok Source https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl It is Apache2
See also NLTK's implementation https://www.nltk.org/_modules/nltk/tokenize/toktok.html
When this is done I think that it should be the default tokenizer. Because multilingual and doesn't screw up URLs
This code is untested, and not yet linked in. I just ported the perl to sed. Well to the PCRE extended sed that we actually use. Which doesn't involve much.
Will want to port over NLTK's tests, which are hopefully comprehensive enough to check that I didn't mess anything up