JuliaText / WordTokenizers.jl

High performance tokenizers for natural language processing and other related tasks

Other

96 stars 25 forks source link

Make sed-based tokenisers 30x faster #10

Closed MikeInnes closed 5 years ago

MikeInnes commented 6 years ago

Bet that made you look.

Right now this package compiles regexes and substitution strings on every call, which is obviously not great for performance. Avoiding this brings times down from ~300μs to ~10μs on my machine.

Given how simple the substitutions are we could likely get (at least) another order of magnitude by compiling code to do everything in one pass, rather than regexing the whole string ~10 times over.

codecov-io commented 6 years ago

Codecov Report

Merging #10 into master will decrease coverage by 23.03%. The diff coverage is 81.33%.

@@             Coverage Diff             @@
##           master      #10       +/-   ##
===========================================
- Coverage    98.8%   75.77%   -23.04%     
===========================================
  Files           4        6        +2     
  Lines          84      161       +77     
===========================================
+ Hits           83      122       +39     
- Misses          1       39       +38

Impacted Files	Coverage Δ
src/words/fast.jl	`79.03% <79.03%> (ø)`
src/words/sedbased.jl	`43.33% <92.3%> (-56.67%)`	:arrow_down:
src/sentences/sentence_splitting.jl	`88.13% <0%> (-9.98%)`	:arrow_down:
src/split_api.jl	`0% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5fad6ff...8a690f9. Read the comment docs.

MikeInnes commented 6 years ago

I pushed a very simple PoC for a coded version of the NLTK tokeniser. Without much work it's about 150x faster than the current thing, and with a little design effort I think it could be much easier to work with and more easily parametrisable than the current regex list.

(This still means my ~300mb corpus will take ~2min to do though, so I think it'd be worth improving on still.)

oxinabox commented 6 years ago

Thanks Mike, nice work, I knew some of this was possible. I'ld been thinking about merging the regex for a while, so it did less passes, i think it is hard and potentially undecidable. (Since PCRE)

Recompiling the regex on every pass was silly, my bad there, We also could switch away from @generated functions, instead to using @eval to generate the functions, once.

Shall we give your tokenizer a new name, and make it our default tokenizer, and be willing to improve it as required, then leave the NLTK tokenizer alone as a sed one?

MikeInnes commented 6 years ago

Yeah I think that's a good route forward; it might eventually be complete enough that it can easily replace the existing tokenisers without changing how they behave, but we can handle that then.

I'm definitely on board with getting rid of the generated functions and eval in this package, I can do that in this PR if you don't get to it first.

MikeInnes commented 6 years ago

Actually, I just went ahead and replaced the NLTK implementation, since this is passing tests now anyway. I think it's a good idea to split out a custom one but that's easy to do if we want it. It might also make sense to port the other two tokenisers to Julia code.

oxinabox commented 6 years ago

Prods @MikeInnes, so docstrings? I like docstrings as comments for internal methods just for some clarity.

MikeInnes commented 6 years ago

Yeah no argument there, just got caught up in other things. Will give this another pass as soon as I can.

oxinabox commented 5 years ago

prod

oxinabox commented 5 years ago

Do you want to cherry pick this into two.

First the stuff to just make sed tokenizers faster as one PR, that i can just merge.

Then the Framework for writing tokenizers, that needs docs, before it can be merged.

MikeInnes commented 5 years ago

I added docstrings. One benefit of waiting so long is that I now understand well why they are needed :)

oxinabox commented 5 years ago

@MikeInnes want to do the last few things so I can merge this?

oxinabox commented 5 years ago

LGTM