Closed oxinabox closed 5 years ago
@oxinabox could you link the reference to the nltk's tests that you mentioned
I thought they would be in https://github.com/nltk/nltk/blob/develop/nltk/test/unit/test_tokenize.py but they are not. May have to write our own. Probably by using NLTK to generate reference tokenizations
Also, I looked in fast.jl, It would be great if there could be some regex based lookahead function. If seems right, maybe I can work on that before
work on it as part of the same PR?
No, in a separate PR
In general most PRs either:
Adding a feature to the purely internal TokenBuffer
would be unusual, particularly when there are not multiple PRs waiting on such a feature.
But I am not against it. A small PR is an easy to review PR.
It will want good unit tests
Oh I see!! No worries I will add all of it in a single PR :)
An earlier incomplete hack at https://github.com/JuliaText/WordTokenizers.jl/pull/5 exists but was never tested.
We should port it, and use the new TokenBuffer API. https://github.com/JuliaText/WordTokenizers.jl/blob/master/src/words/fast.jl
Summary (From #5)