Improving Closure Compiler variable renaming for JQuery like code

GoogleCodeExporter commented 9 years ago

This was something I pointed out in the external meeting today.

The short story:

We were pretty much aware that our variable renaming algorithm is not so good 
for JQuery. A few of us tried different heuristics that can fix this. All of 
them were deleted because it neither hurt Closure Library style code or would 
make renaming very unstable. We concluded that the existing algorithm still do 
a pretty good job. But something must have happened between JQuery 1.7 and 
JQuery 1.9.1 that might have amplified the issue.

Long story:

Here is how renaming works today for a local scope. (Note that the whole JQuery 
is one huge function with lots of inner functions) Each variable gets renamed 
to L_# where # is the index of where they appear in the scope.

Example:

function global(w) {
  function y(a,b,c) {

  }

  function z(m,n) {

  }

  var j;
  var k;
  var l;
  var o;
  var p;
}

Would become 

function global(L_1) {
  function L_2(L_8,L_9,L_10) {

  }

  function L_3(L_8,L_9) {

  }

  var L_4;
  var L_5;
  var L_6;
  var L_7;
}

The next step is we count the frequencies of each variable names, assign a 
short name based on that information. L_8 gets 'a' if it is used all over the 
place and L_44 gets "zzz" if it only occur once. This heuristic gives us a way 
to make sure that:

1. We would end up with lots of "function(a,b,c.." as substrings in the output. 
These are awesome for gzip's back pointer. Info here: 
https://en.wikipedia.org/wiki/LZ77_and_LZ78

2. Without compromising #1, we still have variables that are more frequently 
used assigned a short name.

However, we could potentially run into odd ball cases. Consider my previous 
example. Suppose 'o' and 'p' is used *very* often. Just because they appeared 
in that order, they ended up being L_6 and L_7. Instead, it would have been 
much better to use L_8 and L_9 for o and p since there are probably lots of L_8 
and L_9's in other part of the code.

Now you might ask.

"Well, what about that variable name shadowing pass you talk about all the 
time?"

That's a very good point. Shadowing would fix this case. L_8 and L_9 are prime 
candidate for shadowing. L_8 would become L_6, L_9 would become L_7. Now that 
those names are merged, they would now have a very high priority in the short 
name queue.

However, it is very easy to stop the compiler from shadowing. Let say I added 
this function to the end of the example:

 function yy(x,y) {
   alert(o); alert(p);
 }

Intuitively this stops shadowing of o and p. IE:
  function yy(o,p) {
   alert(o); alert(p);
 }
 is now a completely different function.

The end result is that as long as you have one function that prevented 
shadowing, all function can no longer shadow L_6 and L_7. Why? Because yy would 
have looked like this at some point.

  function L_32412(L_8,L_9) {
   alert(L_6); alert(L_7);
 }

Because that all first parameters must be L_8 and Shadowing concludes that L_8 
can't shadow L_6. We bail out on shadowing o and p for all the functions.

Finally, I am just pointing out the problem. I don't have a good solution :)

Original issue reported on code.google.com by acle...@gmail.com on 1 Apr 2013 at 8:07

GoogleCodeExporter commented 9 years ago

Original comment by acle...@gmail.com on 1 Apr 2013 at 8:13

Changed state: New

GoogleCodeExporter commented 9 years ago

From your example, it seems that in the first pass we choose globally unique 
variable names for everything except formal parameters of functions.

It might be better to have the pass work on each individual scope separately. 
The signature of the renaming function would be sth like:
void renameVars(Node n, Set<String> inUse);
where n is the root node of a scope, and inUse is a set of variable names we 
aren't allowed to use in that scope. Any other name is fair game.

This way, when we call renameVars for y or z, we know that o and p don't 
interfere with the choice of names, and when we call renameVars for yy we know 
that yy references o and p so we can't shadow them.

Original comment by dim...@google.com on 1 Apr 2013 at 9:08

GoogleCodeExporter commented 9 years ago

As I understand it, the real problem is that "$" gets L1 in one local scope but 
L2 in another local scope.  Right? Could we just fix this by having a better 
algorithm for assigning L variable IDs?

Original comment by Nicholas.J.Santos on 1 Apr 2013 at 9:22

GoogleCodeExporter commented 9 years ago

As in assign the L variable ids by local scope frequency rather than logical 
order?

Original comment by chadkill...@missouristate.edu on 1 Apr 2013 at 9:34

GoogleCodeExporter commented 9 years ago

@dimvar:

That's possible. The problem is that right now we do everything in a single 
order. How do we compute 'inUse'? It is a trade-off problem. You can 
potentially preserve more variables in the 'inUse' to save size but it might 
hurt that really big function.

To illustrate this trade-off more clearly. Lets go back to my example:

Suppose in function yy: We have an algorithm that detects the shadowing 
conflict and wants to not use L_8 and L_9 for parameter. At this point, we have 
no idea if it is worth it. Yes, lowering total number of variable names by 
shadowing is good but by doing that you have one less function that have a 
substring starting with "function(a,b". Will it be good? The only way to find 
out is to run gzip on it.

Hehe. If I have the time and the skill in computational complexity, I think we 
can prove that renaming vars for optimal gzip size is pretty much NP-Complete.

@Nick:

I am not sure. I am mostly unfamiliar with JQuery. Where is $ defined? Can you 
explain why you think $ is the problem? When I look at JQuery 1.9, I see uglify 
able to use 1 char names in lots of occasions.

While I have your attention, do you remember this patch you sent out in 2011 
that address this?

Original comment by acle...@gmail.com on 1 Apr 2013 at 10:00

GoogleCodeExporter commented 9 years ago

@Alan: "$" is only defined in jQuery near the end of the file during the 
exports (window.jQuery = window.$ = jQuery). Internally, references all use the 
"jQuery" name not the "$" alias. As far as I know this is true for quite a few 
jQuery versions (at least as far back as 1.7 - probably farther).

I thought of what changed though. Prior to 1.9, each file (module) had it's own 
function wrapper. In jQuery 1.9 everything was merged into one non-conflicting 
namespace and a single wrapper function surrounds the entire code base.

Original comment by chadkill...@missouristate.edu on 1 Apr 2013 at 10:08

GoogleCodeExporter commented 9 years ago

@Alan: 
> Yes, lowering total number of variable names by shadowing is good but by 
doing that you have one less function that have a substring starting with 
"function(a,b". Will it be good?

That's not necessary. We know which variable names are used in the whole 
program before renaming. If we pick names for formal parameters that don't 
appear in the program, all functions will start with the same substring and 
we'll still have maximum shadowing.

Original comment by dim...@google.com on 1 Apr 2013 at 10:16

GoogleCodeExporter commented 9 years ago

Actually, sorry, it's not possible that all functions start with the same 
prefix, eg,

function f(x) {
  ...
  function g(y) {
    ... references x ...
  }
  ...
}

Clearly, we can't use the same variable for x and y.

But we can use the same names for all functions that at the same nesting level 
at least.

Original comment by dim...@google.com on 1 Apr 2013 at 10:21

GoogleCodeExporter commented 9 years ago

@dimvar:

The problem still lingers in the same nesting level. Same example:

  var o, p;
  x  = function (a,b,c) {}
  y  = function (m,n) {}
  yy = function (x,y) {alert(o); alert(p)}

Suppose you want to shadow o p: (code #1)
  var o, p;
  x  = function (o,p,c) {}
  y  = function (o,p) {}
  yy = function (x,y) {alert(o);alert(p)} <---we can't use o, p here. Therefore we needed up with a function that isn't function(o,p..

You can argue that instead of renaming the parameters, we can just rename, o 
and p?

  var a, b;
  x  = function (a,b,c) {}
  y  = function (a,b) {}
  yy = function (a,b) {alert(a); alert(b)} <--- this is wrong, there is no way to access the original o and p now.

At the end of the day, the only correct naming scheme are (code #1) and this 
(code #2)

  var o, p;
  x  = function (a,b,c) {}
  y  = function (a,b) {}
  yy = function (a,b) {alert(o);alert(p)} 

Again, we don't know which if #1 or #2 is better until we gzip it.

Original comment by acle...@gmail.com on 1 Apr 2013 at 10:47

GoogleCodeExporter commented 9 years ago

And for comments about just a smarter L_# renaming algorithm:

It might not be as simple neither.

consider this:
function()   { ....... var x; var y;.... }
function()   { ....... var m; .... }
function()   { ........var n; }

where x, y are used a lot, we give it L_1 and L_2.

Now when we are at 'm' (another frequently used local), do we give it L_1 or 
L_2? It seems like a greedy algorithm mapping it to L_1 make sense, right? 
That's not always the case. Gzip can use dynamic huffman encoding. If there are 
no more L_1's after the first function and the total count of y, m and n are 
greater than the count of x. It is actually better to use L_2 since gzip can 
use a new encoding that favors L_2 after the first function. (doesn't this 
sound like a http://en.wikipedia.org/wiki/Knapsack_problem?)

Now add back shadowing and the "function(a,b,c" problem....
-What if the 2nd function is actually function(m){? Do we give up a 
"function(a" substring?
-What if not naming 'n' to L_1 or L_2 enables an extra shadowing? Would that be 
worth it?

What a mess! :)

Original comment by acle...@gmail.com on 1 Apr 2013 at 11:35

GoogleCodeExporter commented 9 years ago

can somebody post repro steps of the actual bug being reported, for those of us 
who missed the meeting today? is it just closure-compiler simple mode vs. 
uglify on jquery 1.9.1?

Original comment by Nicholas.J.Santos on 2 Apr 2013 at 4:45

GoogleCodeExporter commented 9 years ago

Simple mode vs uglify on JQuery 1.9.1

Original comment by acle...@gmail.com on 2 Apr 2013 at 9:02

GoogleCodeExporter commented 9 years ago

Thinking of this overnight, I'm wondering how the following renaming heuristic 
would compare to the current one - it's based on function scope depth:

function global() {
   function foo1(arg1, arg2, arg3) {
      function bar(x, y ,z) {
         var barlocal1;
      }
      var foolocal1;
   }
   function foo2() {
      function bar2(x, y ,z) {
         var bar2local1;
      }
      var foo2local1;
   }
   var global_local1;
}

Intermediate renaming step 1 - rename function args based on scope depth:

function global() {
   function foo1(D1_1, D1_2, D1_3) {
      function bar(D2_1, D2_2, D2_3) {
         var barlocal1;
      }
      var foolocal1;
   }
   function foo2() {
      function bar2(D2_1, D2_2, D2_3) {
         var bar2local1;
      }
      var foo2local1;
   }
   var global_local1;
}

Intermediate renaming step 2 - rename local variables based on frequency, 
reusing the best function arg name available:

function global() {
   function foo1(D1_1, D1_2, D1_3) {
      function bar(D2_1, D2_2, D2_3) {
         var L3;
      }
      var L2;
   }
   function foo2() {
      function bar2(D2_1, D2_2, D2_3) {
         var D1_2;
      }
      var D1_1;
   }
   var L1;
}

In this way all functions with the same scope depth will have the same 
signature pattern.

@Nick - here's the current gzip results after compilation:
33641 - closure compiler
32769 - uglifyjs

Original comment by chadkill...@missouristate.edu on 2 Apr 2013 at 1:24

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

@chad:

The truth is, I don't know. :) It goes back to my point that we can probably 
never find the most optimal algorithm.

Now for your example, how are you doing step 2? How are the new names for 
bar2local1 and fo2local1 selected?

Original comment by acle...@gmail.com on 2 Apr 2013 at 5:05

GoogleCodeExporter commented 9 years ago

@alan
I assumed at that at the second step, you were traversing the tree again and 
that you knew from the previous traversal which function arg names were used 
and at what frequency. The example isn't complete, but the assumption was that 
D1_1 would be referenced very frequently because every function at scope depth 
1 with at least one argument would have such a reference. In that case, I knew 
in the current branch of the traversal that D1_1 was not in use (no arguments 
to the function) and so assigned the local variable that name.

I came at this idea by trying to look at the problem this way:

1. function signatures will always be longer than local variable names and so 
matches here will produce much better gzip results. Functions are also used a 
lot. Prioritize matching function signatures above local variable names.

2. functions at the same scope depth will never have conflicting arguments.

3. In practice, I'm guessing that function nesting depth is pretty flat. 
Greater than 3 seems unusual. Greater than 5? (lab/generated code excepted).

I agree with you that there is no such thing as the most optimal solution here. 
I'm just looking for better :-)

Original comment by chadkill...@missouristate.edu on 2 Apr 2013 at 6:40

GoogleCodeExporter commented 9 years ago

@chad:

The L_# renaming step should be very trivial Java code if you are interested in 
messing around with it.

I am trying to make sense out of one of your previous comments:

"In jQuery 1.9 everything was merged into one non-conflicting namespace and a 
single wrapper function surrounds the entire code base."

Does this imply the wrapper function holds lots of variables or everything is 
just one variable serving as the namespace?

Original comment by acle...@gmail.com on 3 Apr 2013 at 5:40

GoogleCodeExporter commented 9 years ago

@alan The wrapper now holds a lot of variables. The jQuery team worked to make 
sure differing files used uniquely named variables so that when the files were 
appended, no conflicts occurred.

--In 1.7--
File1:
var jQuery = (function(window, undefined) {
  var local1 = "";
  var local2 = "";
  var jQuery = function() {};
  return jQuery;
})(window)
File2:
(function(window, undefined) {
  var local1 = "";
  var local2 = "";
  jQuery.prototype.prop = function() {};
})(window)
...

--In 1.9--
File1:
  var file1_local1 = "";
  var file1_local2 = "";
  var jQuery = function() {};

File2:
  var file2_local1 = "";
  var file2_local2 = "";
  jQuery.prototype.prop = function() {};
  window.jQuery = jQuery;

Final output (wrapper added as build step):
(function(window, undefined) {
  var file1_local1 = "";
  var file1_local2 = "";
  var jQuery = function() {};

  var file2_local1 = "";
  var file2_local2 = "";
  jQuery.prototype.prop = function() {};
  window.jQuery = jQuery;
})(window)

Original comment by chadkill...@missouristate.edu on 3 Apr 2013 at 6:07

GoogleCodeExporter commented 9 years ago

i think this might be a red herring. I ran some tests with jquery 1.9.1. All 
results are gzipped:

uglify: 33079
closure-compiler: 33551
uglify (no variable renaming): 40782
closure-compiler (no variable renaming): 41214

I think the big differences are in how we choose names, and in how we choose 
line breaks. For names, we choose "a", then "b", then "c", and so on. Uglify 
tries to use commonly-used characters first. I think we've often talked about 
doing this, and you save some bytes this way. Removing line breaks is a bigger 
win.

If I do those 2 things, closure-compiler comes down to 33167.

All of that said, I've never really been interested in competing with YUI or 
Uglify. Small improvements on "simple" JS compression don't really help people 
much. I'll let you all decide how much more time/research to allocate to this.

Original comment by Nicholas.J.Santos on 6 Apr 2013 at 6:54

GoogleCodeExporter commented 9 years ago

Alan is having fun

Original comment by johnl...@google.com on 6 Apr 2013 at 5:25

GoogleCodeExporter commented 9 years ago

Yep. Pretty much. Just scratching an itch.....

Original comment by acle...@gmail.com on 9 Apr 2013 at 12:57

GoogleCodeExporter commented 9 years ago

I'm working on the heuristic I mentioned above. Once I get it to a somewhat 
functional state, I'll let you know how it performs I my code. If it does 
better, someone can test it on Google code.

Original comment by chadkill...@missouristate.edu on 9 Apr 2013 at 12:30

GoogleCodeExporter commented 9 years ago

Sure. I have been testing out a few heuristics when I have some free cycles as 
well.

My approaches has been trying to get as many variables to fall into rule #2 
while completely following rule #1 of my first post.

Original comment by acle...@gmail.com on 9 Apr 2013 at 5:47

GoogleCodeExporter commented 9 years ago

Wow. This is a meta issue on lots of renaming issues. It is hard to keep track 
of the conversation.

I can now confirmed that is *not* uglify using commonly used chars first that's 
the issue. As I stated before, Closure Compiler code was a lot bigger even 
before gzipping. It is the underlying renaming algorithm.

Original comment by acle...@gmail.com on 11 Apr 2013 at 9:27

GoogleCodeExporter commented 9 years ago

Original comment by concavel...@gmail.com on 15 Apr 2013 at 8:02

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

It is amazing that on large code bases a slightly 'smarter' name generation has 
impact in the ~20 *bytes* range.

Original comment by acle...@google.com on 26 Apr 2013 at 7:43

GoogleCodeExporter commented 9 years ago

Issue tracking has been migrated to github. Please make any further comments on 
this issue through https://github.com/google/closure-compiler/issues

Original comment by blic...@google.com on 1 May 2014 at 6:31

Added labels: Restrict-AddIssueComment-Nobody

GoogleCodeExporter commented 9 years ago

Original comment by blic...@google.com on 1 May 2014 at 6:34

Changed state: Duplicate

HoriSun / closure-compiler

Improving Closure Compiler variable renaming for JQuery like code #963