Faster Preprocess [WIP]

Ayushk4 commented 5 years ago

An attempt for the approach mentioned in #143 . As of now, it's near about as fast as the existing one. Still Work-In-Progress with some functions.

Fixes #74 as well ( Refer #76 )

[X] Strip_Articles
[X] Strip_pronouns
[X] Strip_Prepositions
[X] Strip_Stopwords
[X] Whitespace
[X] Corrupt_utf8
[X] Punctuation
[X] Numbers
[ ] Strip_case
[ ] Strip_frequent and strip_sparse
[ ] Fixes #23
[ ] Tests
[X] Docstrings
[ ] Documentation

Ayushk4 commented 5 years ago

This currently supports strip_articles, strip_pronouns, string_prepostions, strip_stopwords - Operations, and on those 4 operations, is at least 4 times faster for 100000 character length docs, 2-3 times faster for 10000 length docs. Works much faster for larger sized documents, but converges to same speed as existing one for smaller documents.

julia> @time fastpreprocess(StringDocument(s))
  0.006278 seconds (3.78 k allocations: 693.500 KiB)

julia> @time prepare!(StringDocument(s), strip_articles | strip_pronouns | strip_stopwords | strip_prepositions)
  0.024585 seconds (1.65 k allocations: 207.063 KiB)

julia> length(s)
100000

julia> @time prepare!(StringDocument(s), strip_articles | strip_pronouns | strip_stopwords | strip_prepositions)
  0.027906 seconds (1.65 k allocations: 207.063 KiB, 15.04% gc time)

julia> @time fastpreprocess(StringDocument(s))
  0.007384 seconds (3.78 k allocations: 693.500 KiB)

aviks commented 3 years ago

Hey @Ayushk4 can we finish this on up?

Ayushk4 commented 3 years ago

I was only able to get this work faster on the initial couple of operations added. When I incorporated the same token buffer approach for more operations later, it resulted in much slower performance overall than the already existing one.

For the time being, I am closing it. If I find some other way to speed it up, then I will re-open this or send another PR.

JuliaText / TextAnalysis.jl

Faster Preprocess [WIP] #163