JuliaText / WordTokenizers.jl

High performance tokenizers for natural language processing and other related tasks
Other
96 stars 25 forks source link

[WIP] Port TokTok #5

Closed oxinabox closed 5 years ago

oxinabox commented 6 years ago

I have started to port TokTok Source https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl It is Apache2

See also NLTK's implementation https://www.nltk.org/_modules/nltk/tokenize/toktok.html

When this is done I think that it should be the default tokenizer. Because multilingual and doesn't screw up URLs

This code is untested, and not yet linked in. I just ported the perl to sed. Well to the PCRE extended sed that we actually use. Which doesn't involve much.

Will want to port over NLTK's tests, which are hopefully comprehensive enough to check that I didn't mess anything up