Open asbisen opened 5 years ago
FYI: quick test of using replace
while looping over stop_words
appears to be much faster than the existing remove_pattern
method ~<2s vs ~940s.
function preprocess(str_rec)
stop_words = Languages.stopwords(Languages.English())
str_rec = lowercase(str_rec)
for sw in stop_words
rex = Regex("\\b"*sw*"\\b")
str_rec = replace(str_rec, rex => "")
end
return str_rec
end
Not sure if my input is warranted but I just wanted to post a solution I found worked. However, this processes removes stop words from the return value of tokenize (using WordTokenizers)
STOPWORDS = stopwords(Languages.English()); # using Languages
"""
my_tokenize(text, sw)
return iterator for tokenized words in text with stopwords removed by default.
to return only stopwords in text, set argument sw to \'only\'
"""
function my_tokenize(text, sw::String="remove")
if sw == "remove"
return collect(word for word in tokenize(text) if !isin(word, STOPWORDS))
elseif sw == "only"
return collect(word for word in tokenize(text) if isin(word, STOPWORDS))
else
return collect(word for word in tokenize(text))
end
end
I then apply it like:
purpose = select(t_new, :purpose);
lower = lowercase.(purpose);
num_words = length.(my_tokenize.(lower));
I'm welcome to hearing improvements but this was fast and worked for my use case
Calling
prepare!(StringDocument, strip_case | strip_stopwords)
even on a small ~3.4MB file takes forever to return (other than for small strings I have not seen this function finishing successfully).I have tracked the slowness to the following function.
https://github.com/JuliaText/TextAnalysis.jl/blob/c8ae7a217d19f19d8c8e3e22da9ea5970ece40d4/src/preprocessing.jl#L253
Manually performing similar task takes ~1.4 second on a 3.4MB text file. Reason I say similar is because to eliminate stop words manually I first tokenize the document and then filter out the stop words. Which functionally is very different than executing regex on a large string and may not be the ideal approach for preserving the structure of document. (complete code at the end)
I was wondering if there could be a more efficient way to perform this elimination of keywords from a String Document?