getify / literalizer

Specialized heuristic lexer for JS to identify complex literals
16 stars 2 forks source link

Parse HTML comments #19

Closed ljharb closed 11 years ago

ljharb commented 11 years ago

JS supports HTML comments, which is insane, but there it is.

Specifically, currently:

var code = 'foo();\n// bar\n/*baz\n\n\nquux*/\n<!-- html comment\n\n--> another comment\n\nbaz()';
lex(code) => [
  { type: 0, val: 'foo();\n' },
  { type: 1, val: '// bar' },
  { type: 0, val: '\n' },
  { type: 1, val: '/*baz\n\n\nquux*/' },
  { type: 0, val: '\n<!-- html comment\n\n--> another comment\n\nbaz()' }
]

however, it should give:

var code = 'foo();\n// bar\n/*baz\n\n\nquux*/\n<!-- html comment\n\n--> another comment\n\nbaz()';
lex(code) => [
  { type: 0, val: 'foo();\n' },
  { type: 1, val: '// bar' },
  { type: 0, val: '\n' },
  { type: 1, val: '/*baz\n\n\nquux*/' },
  { type: 0, val: '\n' },
  { type: 1, val: '<!-- html comment' },
  { type: 0, val: '\n\n' },
  { type: 1, val: '--> another comment' },
  { type: 0, val: '\n\nbaz()' }
]
getify commented 11 years ago

Hmmmmm...

Despite various theories on html comments in JS, I think these aren't actually comments according to JS.

I think they are pre-stripped by the HTML parser. According to the ES5.1 grammar, // and /* */ are the only validly recognized comments. So I doubt very much that the JS engine is even seeing them by the time it lexes the code.

I decided to ask on twitter to verify. We'll see what comes up.

Also, tried esprima and tried traceur, and neither of them agreed that HTML comments were in any way valid JS. It's clear they both expect for HTML comments to have been processed before they get the code.

getify commented 11 years ago

This doesn't necessarily mean that literalizer shouldn't identify these special (unexpected) comments. Very interesting question. If you use literalizer to process arbitrary (legacy) JS code, code which might be destined for inclusion in an inline <script> tag in an HTML page, you may very well have those HTML comments strewn throughout them (regardless of if that's a good idea).

Right now, literalizer just ignores and skips over them. Well, not entirely. It treats those special characters like actual JS operators, which affects the heuristics of where blocks are allowed, etc. So it's quite possible that an HTML comment could throw off literalizer's identification of regex literals that immediately follow blocks. It's also quite possible that an HTML comment in a certain location could throw off other heuristics.

One might argue that literalizer should recognize them and call them out as specific element types. One might argue that literalizer should recognize them (that is, not let them muck with heuristics), but should just skip over them as regular JS text.

Or, one might just do as esprima and the other parsers seem to do, and punt on them (expect they are not there, and you're at your own risk if you do put them in).

I am quite undecided at this point. There's very valid arguments on all 3 sides.

getify commented 11 years ago

Well... here we go, @mathiasbynens responded. It is a non-standard but de-facto part of all JS engines in browsers.

That doesn't change anything about my indecision as to whether literalizer should align its behavior more with this non-standard stuff in browsers, or more like all the other JS-based parsers/tools out there, which seem to universally NOT handle these types of comments.

getify commented 11 years ago

Uh oh... @rwaldron drops the mic.

var a = 1, b = 1; a <!--b;

This is bad, because if JS tooling doesn't recognize the <!-- as a single line comment, but the browser engine does, then the tooling will come up with different results for the code. That sucks. :(

mathiasbynens commented 11 years ago

That doesn't change anything about my indecision as to whether literalizer should align its behavior more with this non-standard stuff in browsers, or more like all the other JS-based parsers/tools out there, which seem to universally NOT handle these types of comments.

Maybe this helps: what is your definition of “non-standard”? Not part of ECMAScript? There is a standard that defines these things, so technically it’s not non-standard.

Ideally, JavaScript parsers and other tools would implement JavaScript, as in, Web ECMAScript, instead of just ECMAScript. (Well, ideally, there wouldn’t be a separate JavaScript / Web ECMAScript spec, and everything (incl. compatibility requirements) would be defined by the ECMAScript spec itself. We’re slowly getting there.)

getify commented 11 years ago

@mathiasbynens but does it make sense for only one JS tool to implement to a different standard than the other JS tools? How does that help tool interop? Is(n't) tooling interop and predictability every bit as important as predictability of code interpretation at different layers of the stack?

I'm not saying I have an answer yet. I'm quite torn.

mathiasbynens commented 11 years ago

Is(n't) tooling interop and predictability every bit as important as predictability of code interpretation at different levels?

No.

When tools have bugs (and not accounting for real-world usage is definitely a bug), they can and should be fixed. Behavior in JavaScript engines and in browsers that is there because of compatibility requirements is unlikely to be changed/removed.

So, the only way to improve the situation is to fix the broken tools.

getify commented 11 years ago

@mathiasbynens except that it's unclear to me if literalizer was the only tool to start recognizing HTML comments that this would be seen as a benefit of literalizer which the other tools might aspire to live up to, or if it would be seen as detrimental (from the tooling perspective) to deviate from all the other tools in some "strange" way?

literalizer wants "market share" just like browsers do. It needs to behave in a way that makes it a peer to other JS tools, or better, but certainly not seen as "non-standard" or, worse, "sub-standard". ;-)

All this is to say that the decision about how to handle this is more complicated than it seemed at first glance.

ljharb commented 11 years ago

Sorry for not providing more context initially. Worth nothing that node (because it uses v8) supports this syntax too.

Since one of the main purposes of literalizer is to preparse difficult literals so that other tools can parse the rest - whether for enforcing style, syntax, or anything else - I think it's a good idea for any possibly valid syntax to be supported.

It could definitely be a new segment type - it may not be a "comment" - but it's something.

getify commented 11 years ago

confirmed that node does in fact allow these. surprises me. but proves this is a discussion worth having.

but the question of "valid syntax" is a relative term, IMO. "valid" according to who? before this morning, I would have said that ES is the only who that matters. now I'm not so sure.

and as I said above, it still matters to me how other JS tools are going to address this question. I'm going to file bugs with them and see how those conversations shake out.

ljharb commented 11 years ago

All sorts of JS libraries account for the fact that certain older browsers return 'function' for the typeof a regex, despite that not being part of the spec. I'd say shooting for ES, but deviating where engines do (as long as it doesn't prevent valid ES syntax from working) is how a tool should work.

I look forward to hearing what other tools decide!

mathiasbynens commented 11 years ago
getify commented 11 years ago

Further info on tooling behavior:

getify commented 11 years ago

I've also posted to the JS-tools list to see what others there think.

getify commented 11 years ago

Here's a discussion thread from awhile back on why uglifyjs considered and ultimately rejected them: https://github.com/mishoo/UglifyJS/issues/346

getify commented 11 years ago

It's apparently slated to standardize in ES6: http://people.mozilla.org/~jorendorff/es6-draft.html#sec-B.1.3

Given that literalizer is already going to great lengths (other filed issues) to support all the complexities that ES6 foists upon us, if they are in fact going to standardize HTML-style comments, we probably should as well. I don't want to. But we probably should.

ljharb commented 11 years ago

I agree with you on all counts.

getify commented 11 years ago

uglifyjs landed a patch this morning, as did Acorn. esprima has accepted it and will patch soon. JSHint is probably going to have to patch for it (since they have a buggy crash/fail).

The cards are falling, and so it's clear now that literalizer will need to do so as well.

However, I probably will have this a configurable option as suggested in #14. Given what we know, the safer thing is probably to have it defaulted on, but then let someone turn that off if they aren't targeting such an engine and/or if they don't want to allow such things through.

If you turn off literalizer tokenizing these, and it encounters them, it will treat them as constituent operators, which might in fact throw off the heuristics of other identifications, but that will just be a risk you explicitly take when you turn off this config option.

getify commented 11 years ago

landed this change just now. It's on by default, but can be turned off by setting this option to true: overlook_html_comment_markers