Closed Huxpro closed 2 years ago
@RyanCavanaugh I think we talked about Babel's and TypeScript's handling of HTML entities a few months ago, but I cannot find where that discussion happened.
I also intentionally left the set of supported HTML entities implementation-defined to allow either HTML4 or HTML5 set but this is also open to discuss.
Do we know if there is any tool using HTML5 entities? As far as I know, the main ones use HTML4. It would be good to reduce implementation-definedness rather than to explicitly allowing it, to guarantee better code portability (this is something that ECMAScript itself tries to do as much as possible).
@nicolo-ribaudo it was over at #132
I also intentionally left the set of supported HTML entities implementation-defined to allow either HTML4 or HTML5 set but this is also open to discuss. This may be seen as a breaking change in some regard.
Given that one of the main concerns against TemplateLiteral support was the, shall we say, "conservative" worry that that non-breaking change would be tumultuous for this hyper-stable spec, I think it would be a mistake to kick off the documentation process by introducing a breaking change right out of the gate. If we can confirm that every current major existing implementation respects the same subset of HTML entities, as frustrating and inelegant as that might be, I think we should just responsibly document that and avoid the temptation of adding language that could create new future conflicts between implementations. For all its current faults, I think one thing that most users don't have to worry about is major differences in what their code means parser to parser. Now, If on the other hand we find that they already deviate, then it might create a valid excuse for trying to define some potentially "new" or "undefined" behavior since it can be argued that the behavior is already inconsistent. Even then though, I think it would be better to simply document each implementation in the spec as a proper description of the current reality. To some extent I think the right way to approach this process is like having to document "JSX quirks-mode" and less like documenting "JSX 1.0", since we find ourselves in a somewhat similar situation where we have so much existing code that uses the existing implementations.
@nicolo-ribaudo @tolmasky I definitely share the same concern, but I was intended to open up this to encourage discussion so we can hear each other!
I wanna share one (and probably THE only one) motivating factor that I had on expanding named character reference supports from HTML4 to HTML5: Because of how numeric (decimal or hex) character reference are basically just unicode code points, both Babel and TypeScript already support referencing HTML5 entities in its numeric format.
Taking <div>⊢ ⊨</div>
as an example, even though ⊢
are not recognized and left as-is, ⊨
will be correctly transformed to \u22A8
by both Babel and TS.
In another word, we only need to pay an extremely minor price of breaking super rare code (like having observed that it doesn't work and insist to write HTML5 named character reference in JSX??), to be able to complete the picture of having full support of HTML character reference (or entities), which, by itself, unfortunately, has a much much larger risk to be dropped completely.
This sort of "completing HTML route" way of thinking is also how it may differ from the rationale against template literal which is more of "starting the JS route" way of thinking.
In another word, we only need to pay an extremely minor price of breaking super rare code (like having observed that it doesn't work and insist to write HTML5 named character reference in JSX??), to be able to complete the picture of having full support of HTML character reference (or entities), which, by itself, unfortunately, has a much much larger risk to be dropped completely.
Yes, this is the part that sucks about documenting existing behavior. The reality is we don't really know who is using these sequences. To give a plausible example, someone could have made a page saying "check out the cool new HTML entities in HTML5:" and listed ⊢
and so forth, and all of a sudden they would all break. A key problem here is precisely that it would have not been surprising for these entities to not work when they originally made the page because they wouldn't have expected them to, since they don't work in JS. This is exacerbated by three additional factors:
I also think that specifically mentioning HTML5 entities in the spec, even as "optional", may have the unintended consequence of putting existing implementations in an awkward position where they look like they are "choosing" the clearly worse option of only implementing HTML4 entities, when in reality they may just be stuck with it for legacy reasons. It may thus in effect simply kick the problem over to the implementors who now have to deal with both users who are upset that HTML5 entities don't work, or are upset that such a change was added, while the specification gets to be "correct" either way.
How the tools deal with unrecognized entities? It seems they will become plain "&entity;" string which means add/remove entities are theoretically breaking change.
and all of a sudden they would all break.
Actually not. The page author needs to upgrade their babel/typescript version and re-compile the JSX to get the breaking behavior. It doesn't like making a breaking change on the Web standard itself.
and all of a sudden they would all break.
Actually not. The page author needs to upgrade their babel/typescript version and re-compile the JSX to get the breaking behavior. It doesn't like making a breaking change on the Web standard itself.
Sure, but that's true of all changes in libraries ;) (and I can't help but feel like the bar for "breaking change" was lower in that other thread...) We could remove HTML entities completely, switch angle brackets for square brackets, and require tag names to be capitalized, and it would still be the case that no one would experience any breaking behavior as long as they never upgraded anything.
But I think the implication here is that this change could very well be consumed thinking it is a harmless bug-fix (especially since it is being inserted into the existing "version"), and have a change in behavior that would arguably be significant. Again, this is why I think that it would force Babel for example to implement this in the next semver-major version. Even then, it is quite annoying to have to call out this sort of subtle but significant behavior change in upgrade notes. I could definitely see some teams making the decision that this is a minor
(or even patch
?) change and thus it being taken in "automatically" just by semver resolution. If JSX itself were semantic versioned, we could at least enforce the semver-majorness of the change, but precisely because we are doing the "pre-changes and pre-versioning work", I think we should avoid changes.
I looked at how everyone does its thing, with:
const x = <y z="alpha & bravo charlie ©cat; delta echo © foxtrot ≠ golf 𝌆 hotel india ⊢ juliett kilo &lima; mike & november">
alpha & bravo
charlie ©cat; delta
echo © foxtrot ≠ golf 𝌆 hotel
india ⊢ juliett
kilo &lima; mike & november
</y>
// Note:
// - the above uses the same value in an attribute and as text, so that the algorithm can be observed in both places.
// - `&` without `;` turns into `&` by the HTML algorithm, not by XML
// - `©` turns into `©` in HTML
// - echo..hotel are all normal reference
// - `⊢` is an “HTML 5” named character reference, it’s not in HTML 4.
// - `&lima;` is unknown
// - `&` is just a & according to HTML
alpha & bravo charlie ©cat; delta echo \xA9 foxtrot \u2260 golf \uD834\uDF06 hotel india ⊢ juliett kilo &lima; mike & november
alpha & bravo charlie ©cat; delta echo \u00A9 foxtrot \u2260 golf \uD834\uDF06 hotel india ⊢ juliett kilo &lima; mike & november
alpha & bravo charlie ©cat; delta echo \xa9 foxtrot ≠golf 𝌆 hotel india ⊢ juliett kilo &lima; mike & november
esbuild:
$ npx esbuild --loader=jsx <<< 'const x = <y z="alpha & bravo charlie ©cat; delta echo © foxtrot ≠ golf 𝌆 hotel india ⊢ juliett kilo &lima; mike & november">
alpha & bravo
charlie ©cat; delta
echo © foxtrot ≠ golf 𝌆 hotel
india ⊢ juliett
kilo &lima; mike & november
</y>'
alpha & bravo charlie ©cat; delta echo \xA9 foxtrot \u2260 golf \u{1D306} hotel india ⊢ juliett kilo &lima; mike & november
To recap, that means that everybody works exactly the same. No “HTML 5” character references. No “HTML” algorithm. No errors.
Representing the TypeScript implementation side of this, I'll just echo this from earlier in the thread:
I also think that specifically mentioning HTML5 entities in the spec, even as "optional", may have the unintended consequence of putting existing implementations in an awkward position where they look like they are "choosing" the clearly worse option of only implementing HTML4 entities, when in reality they may just be stuck with it for legacy reasons. It may thus in effect simply kick the problem over to the implementors who now have to deal with both users who are upset that HTML5 entities don't work, or are upset that such a change was added, while the specification gets to be "correct" either way.
Cross-tool compatibility really should be the highest-order bit here. If your app renders properly in esbuild, you should be able to switch to Babel, or switch from tsc to SWC without worrying that users will start seeing unrendered &foo;
in some corner case of your application.
Even if tools add configuration settings to control which entity set is used, it's not immediately obvious which entities are HTML4, HTML5, or even newer, so you might not realize until you get an embarrassing bug report that you had that configuration wrong in the first place.
Thanks everyone for making inputs. Since we've reached a clear consensus I've updated this PR to only allow HTML4 named entities.
I guess the only remaining question is https://github.com/facebook/jsx/pull/136#discussion_r814680661 (i.e. is numeral entity allowed to be greater than 6 hex digits)
I'd recommend going for the strictest restriction: at the parser. Say that they can only be 6 or 7 characters. Because they can. Instead of 100. Which they can't. From markdown experience specifically, if you don't provide maximums, things can go on forever, meaning stuff can be (potentially) DDOS'ed or at least slow down! But I'd give more weight to Babel/TS folks' opinions!
I'd recommend going for the strictest restriction: at the parser. Say that they can only be 6 or 7 characters. Because they can. Instead of 100. Which they can't. From markdown experience specifically, if you don't provide maximums, things can go on forever, meaning stuff can be (potentially) DDOS'ed or at least slow down! But I'd give more weight to Babel/TS folks' opinions!
@wooorm,
🤔 I doubt allowing unlimited length of hex digits here will hurt parser performance. This should be as trivial as lexing a numeric literal which doesn't involve any ambiguity and lookahead.
Since HTML doesn't limit this and TS seems to already does this, and since @nicolo-ribaudo has expressed willingness to make change from the Babel's side, I would propose that we keep this unlimited.
Size in browsers: TL;DR; browsers support giant character references:
code = '&'.charCodeAt(0).toString(16)
checks = [2**4, 2**8, 2**16, 2**24];
checks.forEach(d => {
document.body.innerHTML = '&#x' + code.padStart(d, '0') + ';';
console.log(document.body.innerHTML)
})
&
&
&
&
(tested in chrome, firefox, safari)
To test long character references in JSX implementations, I used:
<>
<div id="&" />
<div id="&" />
<div id="&" />
</>
⚠️ Note: for some reason on Safari the super long character reference does not display. They’re the `24
,
28, and
216` generated references from the first code**
Babel: đź’Ą
TypeScript: Does support long character references
SWC: đź’Ą
esbuild:
npx esbuild --loader=jsx <<< '<>
<div id="&" />
<div id="&" />
<div id="&" />
</>'
⚠️ Note: for some reason on Safari the super long character reference does not display. They’re the `24
,
28, and
216` generated references from the first code**
Yields:
/* @__PURE__ */ React.createElement(React.Fragment, null, /* @__PURE__ */ React.createElement("div", {
id: "&"
}), /* @__PURE__ */ React.createElement("div", {
id: "&"
}), /* @__PURE__ */ React.createElement("div", {
id: "&"
}));
TL;DR: 50%/50%
I also checked lower (x
) and upper (X
) (note that browsers support both):
const x = <>
<y id="&" />
<y id="&" />
</>
Babel, TypeScript, SWC: No X
esbuild:
npx esbuild --loader=jsx <<< 'const x = <>
<y id="&" />
<y id="&" />
</>'
const x = /* @__PURE__ */ React.createElement(React.Fragment, null, /* @__PURE__ */ React.createElement("y", {
id: "&"
}), /* @__PURE__ */ React.createElement("y", {
id: "&"
}));
(no X
either)
TL;DR: Nobody supports X
, that does not align with HTML or XML though!:
document.body.innerHTML = '&'; document.body.innerHTML // "&"
My opinion:
X
. It’s pretty arbitrary and likely won’t matter much which way that goes. In such cases I’d prefer to match XML/HTML.Note: for some reason on Safari the super long character reference does not display. They’re the
2**4
,2**8
, and2**16
generated references from the first code
Speaking as an implementor, if the JSX spec didn't specify a limit, but we capped it at something reasonable like 16 for whatever reason, I'd be super OK with that happening. Spec deviations for implementation performance (or even security) reasons are extremely common in practice, e.g. https://stackoverflow.com/questions/4582012/maximum-number-of-parameters-in-function-declaration
I wanna echo @RyanCavanaugh that specifications tend to be more theoretical and more generalized than implementations. A turning machine has unlimited memory in theory which we all know it's not possible. In this particular case, a implementation may use uint8
for its token payload type and thus hit the limitation of 256 chars and that's fine if it's not an important semantics divergence in practice.
But I do wanna thank @wooorm for this detailed investigation and call out this potential concern ❤️
In summary, I think we are good to merge this. I'll file an issue to Babel regarding this change. Please feel free to file issue to other implementations like MDX.
Thanks for everyone involved in the discussion!
Summary
Let's be faithful to the de-facto and document the HTML entity behaviors to the spec. Note that this is not about whether we should "drop this semantics or not", but about documenting the current behaviors that everyone has been living with for years.
The Proposed Normative Change
I'm not aware of any practices specifying such transpiler/transform semantics in ECMA-262 so this is a really interesting attempt 🙂 So I ended up extending
Static Semantics: SV
which is the smartest way I can find to hack the semantics into the ECMA-262 spec. I think this should work and should be accurate enough. I'm curious on how implementors think about it though.I also intentionally left the set of supported HTML entities implementation-defined to allow either HTML4 or HTML5 set. This may be seen as a breaking change in some regard and this is open to discuss here.We've reached consensus that only HTML4 entities are allowed.This commit also close #133 by using
::
for characters which are supposed to be lexical grammars.Close #126 Close #4
Test Plan
open
index.html
and proof-read the spec ;)Preview at my fork https://huxpro.github.io/jsx