JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
374 stars 96 forks source link

strip_punctuation remove uppercase letters and numbers #113

Closed cchderrick closed 5 years ago

cchderrick commented 5 years ago

Strange...

julia> doc = StringDocument("Intel(tm) Core i5-3300k, is a geat CPU! ");
julia> prepare!(doc, strip_punctuation)
julia> doc.text
"ntel tm   ore i k  is a geat   "
zgornel commented 5 years ago

The preprocessing is somewhat broken... You could try StringAnalysis, will get at some point backported here.

using StringAnalysis
doc = "Intel(tm) Core i5-3300k, is a great CPU! ";
s1 = prepare(doc, strip_punctuation);
s2 = prepare(doc, strip_punctuation|strip_numbers);
s3 = prepare(doc, strip_punctuation|strip_whitespace);
@show s1
@show s2
@show s3
# s1 = "Intel tm  Core i5 3300k  is a great CPU  "
# s2 = "Intel tm  Core i k  is a great CPU  "
# s3 = "Intel tm Core i5 3300k is a great CPU "
cchderrick commented 5 years ago

thanks for the pointer, I will try it out in the meantime