facebook / jsx

The JSX specification is a XML-like syntax extension to ECMAScript.
http://facebook.github.io/jsx/
1.95k stars 133 forks source link

Behavior of `�` and lone surrogates unicode entities #146

Open nicolo-ribaudo opened 2 years ago

nicolo-ribaudo commented 2 years ago

Ref https://github.com/babel/babel/pull/14327#discussion_r820274585 @wooorm

This might be another one to spec in JSX btw, because there’s likely divergence between implementations. There are a bunch of different things not allowed by XML/HTML/markdown (such as \0 or lone surrogates)

It looks like currently Babel and TS behave the same (they translate � to \0 and � to \uD800). I didn't test other parsers.

Huxpro commented 2 years ago

Hmm. Could any of you help me understand this issue better?

My understanding is that the current JSX spec allowed � and both Babel and TS are conforming here. Was the concern that � is actually NOT allowed by XML/HTML/markdown spec and implementations such as MDX are behaving differently at this moment?

wooorm commented 2 years ago

For security reasons, several (numeric) character references don’t turn into their corresponding character code according to HTML. They are replaced with U+FFFD (�) or even a different character. At a high-level it’s described in: https://html.spec.whatwg.org/multipage/syntax.html#character-references.

The numeric character reference forms described above are allowed to reference any code point excluding U+000D CR, noncharacters, and controls other than ASCII whitespace.

More concrete, see: https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state.

Note that there are even some C1 Unicode whitespace characters (that thus would be disallowed), that would have a meaning in the Windows 1252 encoding, which in HTML map to those characters. E.g., U+0080 is a “padding character”, but a in Windows 1252. So HTML turns 0x80 into . I don’t particularly recommend this part. But I definitely see value in prohibiting \0, whitespace, lone surrogates, noncharacters, just like 0x10FFFF and higher is prohibited.