Closed touzoku closed 7 years ago
This is an interesting case, thanks.
Although JSX does not have explicit ASI semantics for its own operators, normal ECMAScript grammar still applies, which means ASI should kick in only when without it a syntax error would happen.
The case above with <
is similar to /
ambiguity at the beginning of the line, and so I believe this behaviour is actually correct. Consider the following example:
var foo = {};
/TodoList/
vs
var foo = {}
/TodoList/
In both cases - whether with <
or with /
, only explicit semicolon can turn a character that is normally an operator into a beginning of an expression, which makes the following content to be parsed differently - division vs regex or less-than vs start element.
Does this make sense?
The aforementioned grammar is only ambiguous for the L0 parser (aka tokenizer). For the G1 parser (aka recognizer) it is an easy job to produce a valid parse tree. The grammar is only ambiguous when the parse could produce two or more parse trees. In both cases above there will be a syntax error, i.e. zero parse trees.
The example below gives you an example of really ambiguous grammar that even G1 parser will fail to recognize without additional semantics. This is from C++:
x * y(z);
Here, without knowing the semantics, you don't know if y(z)
is a pointer of type x
, or an operand of x
times y(z)
.
The whole situation above would not exist if acorn
did not use a simple rec-descent algorithm, but used an Earley algorithm instead. Though V8 engine is not using it either (hence your regexp ambiguity example), there is no limitation for JSX parsers not to use some modern parsing algorithms as V8 does not understand JSX anyways.
In order to fix the issue, acorn-jsx
must always parse <
as tt.jsxTagStart
without switching the context, and then patch the parseExprOp
function to perform lookahead for the following tokens — up until the point where the ambiguity is resolved. I'll try to fix it myself, because I have a real customer code where it fails, but this is going to be a messy rewrite...
By the way, it would be nice to add the original acorn
test fixtures to acorn-jsx
, since there is a regression on the following tests of original acorn
due to improper opening angle bracket handling:
<!-- HTML comment
<!--\n;
;\n--> HTML comment
I'm talking not about easy vs hard, but in terms of consistency with existing ECMAScript grammar, hence the example with another operator sign vs expression above.
x * y(z);
Yes, I'm well aware of C++ shortfalls and ECMAScript intentionally avoided same mistakes.
The whole situation above would not exist if acorn did not use a simple rec-descent algorithm, but used an Earley algorithm instead. Though V8 engine is not using it either (hence your regexp ambiguity example), there is no limitation for JSX parsers not to use some modern parsing algorithms as V8 does not understand JSX anyways.
This is not about used algorithm and should never be engine-dependant (hence mentioning V8 doesn't make sense, as all the engines implement it the way ECMAScript is designed), but about the underlying language spec.
Same JSX code parses exactly in the same way in Babel, TypeScript and Flow, and I don't really want to violate the JSX spec and break this implementation's interop with others.
If you want to explicitly parse that code as an expression, add an explicit semicolon at the beginning of the line just like you would before /
, (
, `
or [
.
since there is a regression on the following tests of original acorn due to improper opening angle bracket handling
Yeah, thanks, it was earlier reported in #41 and already fixed - make sure you're using the latest acorn-jsx
version.
I'll close this for now, if you wish, you can open a discussion on the JSX specification repo instead.
Thanks for explanation, it makes sense.
My humble opinion is that "one parse tree is better than no tree", but I agree that if majority of parsers in the JS industry follow the same convention it will do more harm than good to change it.
My original point was that under current ASI rules the semicolon would have been inserted if the tokenizer was producing tt.jsxTagStart
instead of tt.relational
, because the following condition would be met:
The offending token is separated from the previous token by at least one LineTerminator
However, the tokenizer uses a naive algorithm for <
parsing (this is why you had to add one more condition to the readToken
function to handle HTML comments properly). My point was that if the tokenizer used a smarter algorithm, i.e. a "Longest Acceptable Token Match", then ASI would kick in and a valid syntax tree would have been produced. Right now, the offending token happens two tokens further down the input stream, this is why ASI can not be applied, when technically it is possible.
Btw, I'm using your plugin to parse something else (not JSX), so this is why I keep coming up with weird cases.
This parses fine:
This fails:
with the following exception:
Both cases should either fail or parse at the same time. But it does not seem like an easy fix might exist to this with a recursive-descent parser like acorn, since the grammar is ambiguous without a significant lookahead.