JuliaText / WordTokenizers.jl

High performance tokenizers for natural language processing and other related tasks
Other
96 stars 25 forks source link

Add TokTok tokenizer #15

Closed oxinabox closed 5 years ago

oxinabox commented 5 years ago

An earlier incomplete hack at https://github.com/JuliaText/WordTokenizers.jl/pull/5 exists but was never tested.

We should port it, and use the new TokenBuffer API. https://github.com/JuliaText/WordTokenizers.jl/blob/master/src/words/fast.jl

Summary (From #5)

Source https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl It is Apache2

See also NLTK's implementation https://www.nltk.org/_modules/nltk/tokenize/toktok.html

When this is done I think that it should be the default tokenizer. Because multilingual and doesn't screw up URLs

This code is untested, and not yet linked in. I just ported the perl to sed. Well to the PCRE extended sed that we actually use. Which doesn't involve much.

Will want to port over NLTK's tests, which are hopefully comprehensive enough to check that I didn't mess anything up

aquatiko commented 5 years ago

@oxinabox could you link the reference to the nltk's tests that you mentioned

oxinabox commented 5 years ago

I thought they would be in https://github.com/nltk/nltk/blob/develop/nltk/test/unit/test_tokenize.py but they are not. May have to write our own. Probably by using NLTK to generate reference tokenizations

aquatiko commented 5 years ago

Also, I looked in fast.jl, It would be great if there could be some regex based lookahead function. If seems right, maybe I can work on that before

oxinabox commented 5 years ago

work on it as part of the same PR?

aquatiko commented 5 years ago

No, in a separate PR

oxinabox commented 5 years ago

In general most PRs either:

Adding a feature to the purely internal TokenBuffer would be unusual, particularly when there are not multiple PRs waiting on such a feature. But I am not against it. A small PR is an easy to review PR. It will want good unit tests

aquatiko commented 5 years ago

Oh I see!! No worries I will add all of it in a single PR :)