Closed sigmundv closed 8 months ago
Thanks. Would be nice to get a testcase (something that fails in the current version, but works with this fix) so that we're confident of not breaking this in the future.
This change breaks the previous logic of the strip_non_letters
flag. The initial implementation of `strip_non_letters' retains only basic Latin characters. Even other European letters are removed. However, the current preprocessing test includes an explicit check for the removal of the Greek symbol "υπ":
@testset "Preprocessing" begin
sample_text1 = "This is 1 MESSED υπ string!"
sample_text1_wo_punctuation = "This is 1 MESSED υπ string"
sample_text1_wo_punctuation_numbers = "This is MESSED υπ string"
sample_text1_wo_punctuation_numbers_case = "this is messed υπ string"
sample_text1_wo_punctuation_numbers_case_az = "this is messed string"
#...
end
Not sure if this is a real case. The ability to handle Unicode is more useful.
Changed the regex for strip_non_letters in
src/preprocessing.jl
to[^…\p{L}\\s]
, because[^a-zA-Z\\s]
matches non-ascii letters and removes diacritic characters, for example.