MichaelChirico / r-bugs

A ⚠️read-only⚠️mirror of https://bugs.r-project.org/
20 stars 0 forks source link

[BUGZILLA #17639] parse() doesn't honor unicode in NFD normalization #6813

Open MichaelChirico opened 4 years ago

MichaelChirico commented 4 years ago
r$> cat("\u65\u301", "\ue9")
é é
r$> parse(text = "\ue9 <- 1")
expression(é <- 1)

r$> parse(text = "`\u65\u301` <- 1")
expression(`é` <- 1)

r$> parse(text = "\u65\u301 <- 1")
Error in parse(text = "é <- 1") : <text>:1:2: unexpected input
1: e\xcc
     ^

Check the above code snippet. \u65\u301 and \ue9 are the same character of é in different normalization form. However, parse() only honor the NFC form.

The NFD form is fine inside the string quote though.

r$> parse(text = "'\u65\u301'")
expression('é')

METADATA

MichaelChirico commented 4 years ago

The root cause of this is that the "\u65\u301" variant is actually two characters, an "e" and a diacritic "´" (nchar() returns 2). The diacritic is not an alphanumeric character, hence the test in isValidName() (in gram.y) fails.

isValidName() uses iswalnum() & friends, but the decomposed é is not a wide character in that sense, so it checks first the "e" (=="\u65"), and then the "\u301" (accent aigu diacritic).

I don't think this is fixable unless we insert code to explicitly change the normalization, and I am not sure we'd want to do that at the parser level.

A workaround is to normalize in user space (package utf8 has code for that).


METADATA