JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
373 stars 95 forks source link

PR: To address performance issues with stopword removal #141

Closed asbisen closed 5 years ago

asbisen commented 5 years ago

PR to address performance regression stated in #140. This brings the time down from 940s to 0.27s for my test dataset (~3.4MB)

primary change is replacement of method remove_patterns which forced modification of strip_whitespace implementation of prepare! method

function remove_patterns(s::AbstractString, rex::Regex) 
  return replace(s, rex => "")
end

I have also modified test cases to make them consistent; where stripping punctuation or stripping a pattern replaces the matched pattern with 0 length string i.e. deletes the matched pattern.

This required special handling for whitespace removal, where one or more than single space is replaced with a blank_space of length 1. And all leading and trailing spaces are stripped.

I don't think there is a right way for certain pre-processing tasks. For example: with strip_punctuation what is the correct way to handle the following strings when removing punctuations.