gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
304 stars 44 forks source link

Not detecting separators #278

Closed koheiw closed 6 years ago

koheiw commented 7 years ago

I got an interesting text on the internet which contains a lot of non-printing characters. I tried to clean it using stri_replace_all_regex, but did not work. This seems like a bug.


txt <- "( ͡° ͜ʖ ͡°) Who's the chick in your profile picture, I see her everywhere"

(toks <- stri_split_boundaries(txt, tokens_only = TRUE, skip_word_none = FALSE))

txt2 <- stri_replace_all_regex(txt, "[\\p{Z}\\p{C}]", ' ') # clean

(toks2 <- stri_split_boundaries(txt2, tokens_only = TRUE, skip_word_none = FALSE))

stri_detect_regex(toks2[[1]], "[\\p{C}]")
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
stri_detect_regex(toks2[[1]], "[\\p{Z}]")
# [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
stri_detect_regex(toks2[[1]], " ")
# [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

Probably for the same reason, these do not return anything.

(toks <- stri_split_boundaries(txt, tokens_only = TRUE, skip_word_none = TRUE))
(toks2 <- stri_split_boundaries(txt2, tokens_only = TRUE, skip_word_none = TRUE))
gagolews commented 7 years ago

Interesting, thanks. But again - I wonder if this will be the case with the latest ICU. Perhaps they already updated the Unicode Character Database...

gagolews commented 6 years ago

I guess stri_replace_all_regex(txt, "[\\p{Z}\\p{C}\\p{S}\\p{P}\\p{M}]", ' ') almost doest the trick. The problem is with the "nose" element, which is http://www.fileformat.info/info/unicode/char/0296/index.htm