gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
300 stars 45 forks source link

[Feature] Extract Non-ASCII Characters #491

Closed discoleo closed 10 months ago

discoleo commented 1 year ago

Extract Non-ASCII Characters

This feature request is based on a post on the R-Help list:

Split String in regex while Keeping Delimiter\ https://stat.ethz.ch/pipermail/r-help/2023-April/477177.html

The problem arose mainly due to embedded non-ASCII characters. The function stri_escape_unicode can help, but it may be impractical to scan by hand a large corpus or a few thousand reports. A utility function would be very practical.

R Code

### Identify non-ASCII Characters
extract.nonLetters = function(x, rm.ch = " ,.", escape=TRUE, sort=TRUE, normalize=TRUE) {
    if(normalize) str = stringi::stri_trans_nfc(str);
    ch = strsplit(str, "", fixed = TRUE);
    ch = unique(unlist(ch));
    if(sort) ch = sort(ch);
    pat = paste0("^[a-zA-Z", rm.ch, "]");
    isLetter = grepl(pat, ch);
    ch = ch[ ! isLetter];
    if(escape) ch = stringi::stri_escape_unicode(ch);
    return(ch);
}

The function splits the text into characters and keeps the unique characters. It is possible to apply some filtering and keep only non-(ASCII)-letters.

gagolews commented 1 year ago

I am afraid this is too specific to be included in stringi.

Perhaps an easier solution: ?

x <- "test 123 ↓ęœß→óęœ©œ©ęπœęπœ©œπą"
x <- unique(unlist(stringi::stri_enc_toutf32(x)))
x <- x[x>127]
stringi::stri_enc_fromutf32(x)
## [1] "↓ęœß→ó©πą"
discoleo commented 1 year ago
  1. Function stri_enc_toutf32 does indeed the conversion directly. Unfortunately, I am not an expert in the package stringi.

    • it would break if an UTF-64 was introduced: but then again this should be an internal implementation detail inside another function and would be therefore invisible to the end user;
  2. Documentation/Examples with stringi::stri_escape_unicode

    • for the last solution: I would include an extra example mentioning how to convert back to Unicode code points or how to escape the characters with stri_escape_unicode;
    • alternatively, the function could have an option to either return the code-points or the characters;
  3. Utility I have frequently encountered this situation: both when trying to extract information from articles on Pubmed, as well as from various reports (e.g. Lab-reports). There are probably sufficiently large communities involved in both types of operations, but having only rudimentary understanding of string encodings. If the (manual) cleaning is too time-consuming, then the most common options are:

    • to exclude those inputs, or to delete any non-ASCII characters (if the user understands a little bit more about ASCII);
    • both approaches are quite sub-optimal: better tools can come in handy for many simple users;
gagolews commented 10 months ago

So, you probably mean:

x <- "test 123 ↓ęœß→óęœ©œ©ęπœęπœ©œπą"
x <- unique(unlist(stringi::stri_enc_toutf32(x)))
x <- as.list(x[x>127])
stringi::stri_escape_unicode(stringi::stri_enc_fromutf32(x))
## [1] "\\u2193" "\\u0119" "\\u0153" "\\u00df" "\\u2192" "\\u00f3" "\\u00a9" "\\u03c0" "\\u0105"