JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
373 stars 95 forks source link

Faster Preprocess [WIP] #163

Closed Ayushk4 closed 3 years ago

Ayushk4 commented 5 years ago

An attempt for the approach mentioned in #143 . As of now, it's near about as fast as the existing one. Still Work-In-Progress with some functions.

Fixes #74 as well ( Refer #76 )

Ayushk4 commented 5 years ago

This currently supports strip_articles, strip_pronouns, string_prepostions, strip_stopwords - Operations, and on those 4 operations, is at least 4 times faster for 100000 character length docs, 2-3 times faster for 10000 length docs. Works much faster for larger sized documents, but converges to same speed as existing one for smaller documents.

julia> @time fastpreprocess(StringDocument(s))
  0.006278 seconds (3.78 k allocations: 693.500 KiB)

julia> @time prepare!(StringDocument(s), strip_articles | strip_pronouns | strip_stopwords | strip_prepositions)
  0.024585 seconds (1.65 k allocations: 207.063 KiB)

julia> length(s)
100000

julia> @time prepare!(StringDocument(s), strip_articles | strip_pronouns | strip_stopwords | strip_prepositions)
  0.027906 seconds (1.65 k allocations: 207.063 KiB, 15.04% gc time)

julia> @time fastpreprocess(StringDocument(s))
  0.007384 seconds (3.78 k allocations: 693.500 KiB)
aviks commented 3 years ago

Hey @Ayushk4 can we finish this on up?

Ayushk4 commented 3 years ago

I was only able to get this work faster on the initial couple of operations added. When I incorporated the same token buffer approach for more operations later, it resulted in much slower performance overall than the already existing one.

For the time being, I am closing it. If I find some other way to speed it up, then I will re-open this or send another PR.