[BUGZILLA #17639] parse() doesn't honor unicode in NFD normalization

MichaelChirico / r-bugs

A ⚠️read-only⚠️mirror of https://bugs.r-project.org/

20 stars 0 forks source link

r$> cat("\u65\u301", "\ue9")
é é
r$> parse(text = "\ue9 <- 1")
expression(é <- 1)

r$> parse(text = "`\u65\u301` <- 1")
expression(`é` <- 1)

r$> parse(text = "\u65\u301 <- 1")
Error in parse(text = "é <- 1") : <text>:1:2: unexpected input
1: e\xcc
     ^

Check the above code snippet. \u65\u301 and \ue9 are the same character of é in different normalization form. However, parse() only honor the NFC form.

The NFD form is fine inside the string quote though.

r$> parse(text = "'\u65\u301'")
expression('é')

METADATA

Bug author - Randy Lai
Creation time - 2019-10-24 06:56:26 UTC
Bugzilla link
Status - UNCONFIRMED
Alias - None
Component - I/O
Version - R 3.5.0
Hardware - Other Mac OS X v10.6
Importance - P5 minor
Assignee - R-core
URL -
Modification time - 2019-10-24 11:25 UTC

The root cause of this is that the "\u65\u301" variant is actually two characters, an "e" and a diacritic "´" (nchar() returns 2). The diacritic is not an alphanumeric character, hence the test in isValidName() (in gram.y) fails.

isValidName() uses iswalnum() & friends, but the decomposed é is not a wide character in that sense, so it checks first the "e" (=="\u65"), and then the "\u301" (accent aigu diacritic).

I don't think this is fixable unless we insert code to explicitly change the normalization, and I am not sure we'd want to do that at the parser level.

A workaround is to normalize in user space (package utf8 has code for that).

METADATA

Comment author - Peter Dalgaard
Timestamp - 2019-10-24 11:25:36 UTC

MichaelChirico / r-bugs

[BUGZILLA #17639] parse() doesn't honor unicode in NFD normalization #6813

METADATA

METADATA