PR to address performance regression stated in #140. This brings the time down from 940s to 0.27s for my test dataset (~3.4MB)
primary change is replacement of method remove_patterns which forced modification of strip_whitespace implementation of prepare! method
function remove_patterns(s::AbstractString, rex::Regex)
return replace(s, rex => "")
end
I have also modified test cases to make them consistent; where stripping punctuation or stripping a pattern replaces the matched pattern with 0 length string i.e. deletes the matched pattern.
This required special handling for whitespace removal, where one or more than single space is replaced with a blank_space of length 1. And all leading and trailing spaces are stripped.
I don't think there is a right way for certain pre-processing tasks. For example: with strip_punctuation what is the correct way to handle the following strings when removing punctuations.
PR to address performance regression stated in #140. This brings the time down from 940s to 0.27s for my test dataset (~3.4MB)
primary change is replacement of method
remove_patterns
which forced modification ofstrip_whitespace
implementation ofprepare!
methodI have also modified test cases to make them consistent; where stripping punctuation or stripping a pattern replaces the matched pattern with
0
length string i.e. deletes the matched pattern.This required special handling for whitespace removal, where one or more than single space is replaced with a
blank_space
of length 1. And all leading and trailing spaces are stripped.I don't think there is a right way for certain pre-processing tasks. For example: with
strip_punctuation
what is the correct way to handle the following strings when removing punctuations.don't mind!
=>don t mind
ordont mind
Intel(tm) Core i5-3300k
=>Intel tm Core i5 3300k
orInteltm Core i53300k