JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
373 stars 95 forks source link

Fixed UNICODE processing with the `strip_non_letters` flag in src/preprocessing.jl #265

Closed sigmundv closed 8 months ago

sigmundv commented 1 year ago

Changed the regex for strip_non_letters in src/preprocessing.jl to [^…\p{L}\\s], because [^a-zA-Z\\s] matches non-ascii letters and removes diacritic characters, for example.

aviks commented 1 year ago

Thanks. Would be nice to get a testcase (something that fails in the current version, but works with this fix) so that we're confident of not breaking this in the future.

rssdev10 commented 8 months ago

This change breaks the previous logic of the strip_non_letters flag. The initial implementation of `strip_non_letters' retains only basic Latin characters. Even other European letters are removed. However, the current preprocessing test includes an explicit check for the removal of the Greek symbol "υπ":

@testset "Preprocessing" begin

    sample_text1 = "This is 1 MESSED υπ string!"
    sample_text1_wo_punctuation = "This is 1 MESSED υπ string"
    sample_text1_wo_punctuation_numbers = "This is  MESSED υπ string"
    sample_text1_wo_punctuation_numbers_case = "this is  messed υπ string"
    sample_text1_wo_punctuation_numbers_case_az = "this is  messed  string"
#...
end

Not sure if this is a real case. The ability to handle Unicode is more useful.