Closed Ayushk4 closed 3 years ago
This currently supports strip_articles
, strip_pronouns
, string_prepostions
, strip_stopwords
- Operations, and on those 4 operations, is at least 4 times faster for 100000 character length docs, 2-3 times faster for 10000 length docs. Works much faster for larger sized documents, but converges to same speed as existing one for smaller documents.
julia> @time fastpreprocess(StringDocument(s))
0.006278 seconds (3.78 k allocations: 693.500 KiB)
julia> @time prepare!(StringDocument(s), strip_articles | strip_pronouns | strip_stopwords | strip_prepositions)
0.024585 seconds (1.65 k allocations: 207.063 KiB)
julia> length(s)
100000
julia> @time prepare!(StringDocument(s), strip_articles | strip_pronouns | strip_stopwords | strip_prepositions)
0.027906 seconds (1.65 k allocations: 207.063 KiB, 15.04% gc time)
julia> @time fastpreprocess(StringDocument(s))
0.007384 seconds (3.78 k allocations: 693.500 KiB)
Hey @Ayushk4 can we finish this on up?
I was only able to get this work faster on the initial couple of operations added. When I incorporated the same token buffer approach for more operations later, it resulted in much slower performance overall than the already existing one.
For the time being, I am closing it. If I find some other way to speed it up, then I will re-open this or send another PR.
An attempt for the approach mentioned in #143 . As of now, it's near about as fast as the existing one. Still Work-In-Progress with some functions.
Fixes #74 as well ( Refer #76 )