gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
304 stars 44 forks source link

Unexpected Extended ASCII Translation #388

Closed billdenney closed 4 years ago

billdenney commented 4 years ago

Related to sfirke/janitor#389

When trying to translate extended ASCII to printable ASCII (as expected with the stri_enc_toascii() function), I expected the superscript 2 character to convert to either "2" (or perhaps preferably "^2"), but it was converted to something else.

Is there another function or method in stringi that will translate or transliterate extended ASCII to printable ASCII?

"\xb2"
#> [1] "²"
stringi::stri_enc_toascii("\xb2")
#> [1] "\032"

Created on 2020-07-21 by the reprex package (v0.3.0)

gagolews commented 4 years ago

Sure:

> stringi::stri_trans_general("²", "nfkd;nfc;Latin-ASCII")
[1] "2"

Fun fact: the ASCII \032 is the SUBSTITUTE CHARACTER, a kind of NA, but for individual code points.

billdenney commented 4 years ago

Thanks! I read for a bit about nfd, nfc, nfkd, and nfkc, and I'm not sure that I understand more, but I do understand that these appear to be what is needed for this case.

gagolews commented 4 years ago

This one gives a good overview IMO https://www.unicode.org/reports/tr15/

billdenney commented 4 years ago

Thanks to the pointer for the overview of the normalizers and for the info about character 32.