gaborcsardi / rencfaq

The R Encoding FAQ
Creative Commons Zero v1.0 Universal
67 stars 4 forks source link

Ref. for codepoints #4

Open MichaelChirico opened 2 years ago

MichaelChirico commented 2 years ago

Tried looking into this open question in the README:

Here are some handy ways to find the Unicode code points for an existing string:

    Copy and paste into the Unicode character inspector.
    Do we have other suggestions?

From this answer here, it looks like "no", unless that logic were to be put into a common package we could reference:

https://stackoverflow.com/a/6240184/3576984

gaborcsardi commented 2 years ago

Base R works with code points, so this works currently:

x <- "\U0001f477\u200d\u2642\ufe0f"
# https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%91%B7%E2%80%8D%E2%99%82%EF%B8%8F
x
#> [1] "👷‍♂️"

utf8::utf8_print(strsplit(x, "")[[1]], utf8 = FALSE)
#> [1] "\U0001f477" "\u200d"     "\u2642"     "\ufe0f"

Of course it would be more correct to work with graphemes, so if base R will switch to that, then it might not work any more.

Btw. cli also has now some handy functions for UTF-8 strings, e.g. it handles graphemes properly:

cli::utf8_nchar(x)
#> [1] 1

nchar(x)
#> [1] 4
MichaelChirico commented 2 years ago

I think utf8::utf8_print() is what I was after with "putting that logic into a package", let's add a reference to it in the doc there.

It's a good tool for,

OK, I've copy-pasted "👍" into a string in my package for users. Now it's time to submit to CRAN or otherwise run R CMD check, and I'm getting dinged for the non-ASCII characters -- how do I convert it to a \U string?

gaborcsardi commented 2 years ago

Here is a base R solution:

Sys.setlocale("LC_ALL", "C")
#> [1] "C/C/C/C/C/en_US.UTF-8"

x
#> [1] "<U+0001F477><U+200D><U+2642><U+FE0F>"

It will mess up the current session of course...

MichaelChirico commented 2 years ago

Right... still useful to mention. For the use case mentioned, we can just open up a new process & run it there quickly. Nice!

gaborcsardi commented 2 years ago

Yeah, maybe there is a way to restore the locale, but withr::with_locale() refuses to change LC_ALL, and there might be a reason for that:

❯ withr::with_locale(c(LC_ALL = "C"), TRUE)
Error: Setting LC_ALL category not implemented.

callr can run it in another session:

❯ callr::r(function() { Sys.setlocale("LC_ALL", "C"); format("👷‍♂️") })
[1] "<U+0001F477><U+200D><U+2642><U+FE0F>"

Maybe it would be enough to change another category.

gaborcsardi commented 2 years ago

Oh, yeah, here it is:

withr::with_locale(c(LC_CTYPE = "C"), format("👷‍♂️"))
#> [1] "<U+0001F477><U+200D><U+2642><U+FE0F>"