beautifier / js-beautify

Beautifier for javascript
https://beautifier.io
MIT License
8.6k stars 1.38k forks source link

Object division and regex misparsed #1198

Open ghost opened 7 years ago

ghost commented 7 years ago

There are a few issues with JavaScript code parsing. A regex is sometimes interpreted like a comment, comments are interpreted like regexes, objects are interpreted like code blocks, ternary expressions like labels etc.

// Semi-keywords as identifier names (allowed by the specs)
`${async++}//`;
`${await++}//`;
`${let++}//`;

// Undefined after an object
+{1:+{}/undefined};

// Regex is highlighted like a comment while dividing an object in escaped sequence of a template literal string
`${{}/ /\//}`;

// Regex in a class extend
+class extends{1:{}/ /\//}{};

// Regex in a default function parameter
(a={}/ /\//)=>a;

// Regex in a ternary expression
{}/./gym?
{}/ /\//:
{}/ /\//;

// Spread operator before regex
[.../\//];

// Void before regex
void/\//;

// A regex in do...while loop
do/\//
while(0);

// A regex in if..else statement
if(0);else/\//;

// Template literals ecaping sequence
`${1}//`;

// Generator function with empty name as a method of an object (wrongly interpreted)
+{*''(){}};

I have a few more, but I am not sure if it maybe depends on some of the these ones, so I am not posting now all issues.

bitwiseman commented 7 years ago

@guest161373 - Not sure what you're saying. For each of the examples above, please give an example of how it is being misformatted. Your first example `${async++}//;` is unchanged when run through http://jsbeautifier.org/ . That seems right.

ghost commented 7 years ago

@bitwiseman

Some of them are misformatted and some of them are just mishighlighted. For example, in the first test case

`${let++}//`;

two slashes are highlighted like a comment, while it should be part of a string. However, it is just mishighlighted, because if you put braces around it, it become properly indented:

// Input
{`${let++}//`;}

// Output
{
  `${let++}//`;
}

It is an example of wrong highlighting, but correct formatting. However lets take a look at the second example:

+{1:+{}/undefined};

It becomes formatted like this:

+{
  1: +{}
  /undefined};

white it should be like this:

+{
  1: +{} /
    undefined
};

The problem here is not only wrong highlighting (slash is interpreted like start of a regex instead a division operator), but also wrong interpretting. Closed brace is misinterpreted as part of a regex, but there are no regex at all, so closed brace is not placed on a new line.

I hope you already understand what's the issue, but lets see just one more test case:

(a={}/ /\//)=>{return a;};

I just replaced a with {return a;} to make it more obvious where the formatting fails. It is formatted like

(a = {}
  / /\ //)=>{return a;};

while it should be formatted like

(a = {} / /\// ) => {
  return a;
};

JS Beautifier thinks that division is start of a regex and that two slashes are start of a comment, while it is actually an object divided by a regex (also notice that GitHub highlighter misinterpretted a lot of test cases wrongly too).

I think it is now more clar what is the problem. It is not hard to modify all of the test cases (for example put braces around it) to see where formatting fails. You can see here how it looks. Notice that wrongly interpreted parts of the code are circumscribed.

bitwiseman commented 7 years ago

All mis-highlighting is not a js-beautify bug. The beautifier is not involved in highlighting at any level. Depending on what tool you are using, the bug is with them. If you are looking at http://jsbeautifier.org/ , that uses http://codemirror.net/ for highlighting. Sorry, nothing we can do about those.

As to the misformatting/misparsing, those are bugs in the parser, but it would helpful to know where these patterns come from and what their intent is. What does +{} mean? Is valid javascript I guess, but what does it mean? The same with {}/ /\//.

ghost commented 7 years ago

@bitwiseman

As an answer to your question

"Where these patterns come from and what their intent is. What does +{} mean? Is valid javascript I guess, but what does it mean? The same with {}/ /\//."

The code +{} is used to distinguish object from code block. If we want to create an object and immediatelly call its property, the following example is wrong:

{
  func: function(){
    console.log(1);
  }
}.func();

It will throw syntax error. Why? Because there is no single object here. The braces are for code block. If we want to make it object, we need to put some unary operator before it (+, -, !, ~, etc). Simillary is for immediatelly function call:

function(){
  console.log(1);
}();

This is also syntax error (function name is ommited), but if we place unary operator before it, then it will not be a function declaration statement, but instead an anonymous function used in expression and we can call it.

Your second question "What does +{} / /\// mean?" is more abstract. According to the standard, it is an object divided by a regex. However, it might be difficult to find a real situation when someone would use it. For example, I use it in my code obfuscator where I am exploiting EcmaScript standard to make hard-to-disassemble code and prevent stealing and modifying. I thought about it and I actually came up with a real situation it can be used in:

RegExp.prototype.a = 4;
var m = {
  valueOf: () => 12
} / /\//.a;
console.log(m);

Here, we are defining property a of every single regex instance we create. Then, we are declaring and initializing variable m which is object divided by a regex's propery a. So, the object is reduced to number 12 and the second operand is reduced to 4, so 12 / 4 == 3. Of course, it may be used very rare (I actually didn't see anyone is using it instead my obfuscator), but it is still correct ES syntax.

[One paragraph removed due to privacy reasons]

No matter if some code appears in real situations or not, why not cover all cases? Also, code beautifiers are mostly used for minified and obfuscated code, which may inslude a lot of nonsence code structures. There are examples like this one which never throw syntax error but the code is useless. Another example is empty regex character group, which is not allowed in PCRE but allowed in JavaScript. It is correct syntax, but that regex is useless.

bitwiseman commented 7 years ago

If we want to make it object, we need to put some unary operator before it (+, -, !, ~, etc).

That is a fascinating point. I wasn't sure but the beautifier does treat +{} as an object literal for formatting. It seems likely that the problem is that the object literal detection currently occurs after parsing. So, the parser doesn't figure out that the / is an operator rather than a regex literal. Ouch.

You also asked "Where these patterns come from?". Well, my hobby is to create random code generators for different languages in order to check syntax highlighting.

That is pretty awesome. There have been times where I thought, "There aren't many kinds of software development more thankless that writing a code beautifier." You have reminded me that testing is more thankless, and exhaustive/depth tests like what you're describing are even more thankless than that. 😄 I'm impressed and I appreciate you doing this.

No matter if some code appears in real situations or not, why not cover all cases?

Short answer: because resources are limited. The beautifier only recently reached the point of "parsing" the javascript input before formatting. That parsing is approximate, and also attempts to handle a number of templating languages that are not Javascript and other languages similar to Javascript.

I agree that this is a bug in the parsing, and that it should be fixed. This is an important bug highlighting a flaw in one of the assumptions of the current parser implementation. At the same time, I have to balance that against the number of users impacted, the resources needed to fix it, and long term goals of the project. So, this case may take some time before it is fixed.