Closed discoleo closed 10 months ago
I am afraid this is too specific to be included in stringi
.
Perhaps an easier solution: ?
x <- "test 123 ↓ęœß→óęœ©œ©ęπœęπœ©œπą"
x <- unique(unlist(stringi::stri_enc_toutf32(x)))
x <- x[x>127]
stringi::stri_enc_fromutf32(x)
## [1] "↓ęœß→ó©πą"
Function stri_enc_toutf32 does indeed the conversion directly. Unfortunately, I am not an expert in the package stringi.
Documentation/Examples with stringi::stri_escape_unicode
Utility I have frequently encountered this situation: both when trying to extract information from articles on Pubmed, as well as from various reports (e.g. Lab-reports). There are probably sufficiently large communities involved in both types of operations, but having only rudimentary understanding of string encodings. If the (manual) cleaning is too time-consuming, then the most common options are:
So, you probably mean:
x <- "test 123 ↓ęœß→óęœ©œ©ęπœęπœ©œπą"
x <- unique(unlist(stringi::stri_enc_toutf32(x)))
x <- as.list(x[x>127])
stringi::stri_escape_unicode(stringi::stri_enc_fromutf32(x))
## [1] "\\u2193" "\\u0119" "\\u0153" "\\u00df" "\\u2192" "\\u00f3" "\\u00a9" "\\u03c0" "\\u0105"
Extract Non-ASCII Characters
This feature request is based on a post on the R-Help list:
The problem arose mainly due to embedded non-ASCII characters. The function stri_escape_unicode can help, but it may be impractical to scan by hand a large corpus or a few thousand reports. A utility function would be very practical.
R Code
The function splits the text into characters and keeps the unique characters. It is possible to apply some filtering and keep only non-(ASCII)-letters.