[dart2js] Minify enum names please.

dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.

https://dart.dev

BSD 3-Clause "New" or "Revised" License

10.11k stars 1.56k forks source link

[dart2js] Minify enum names please. #53988

Closed hamsbrar closed 10 months ago

hamsbrar commented 10 months ago

enum E {
  descriptiveName1,
  descriptiveName2,
  descriptiveName3,
}

void main() {
  for (var e in E.values) {
    print(e.index);
  }
}

Output of dart js compile -O4 contains:

B.b=new A.r(0,"descriptiveName1")
B.c=new A.r(1,"descriptiveName2")
B.d=new A.r(2,"descriptiveName3")
//             ^^^^^^^^^^^^^^^^ these ones
// ...

r:function r(a,b){this.a=a
this.b=b},

// main()
bX(){for(var t=0;t<3;++t)A.bZ(""+B.i[t].a)},

> dart --version
Dart SDK version: 3.1.5 (stable) (Tue Oct 24 04:57:17 2023 +0000) on "linux_x64"

Same on the newest dev release(3.3.0-84.0.dev (dev) (Tue Oct 31 17:06:17 2023 -0700)).

Also, not using Enum.name anywhere but it's part of the output, appears to be required by toString:

// non-minfied output:

B.E_0 = new A.E(0, "descriptiveName1");
B.E_1 = new A.E(1, "descriptiveName2");
B.E_2 = new A.E(2, "descriptiveName3");
B.List_q2S = makeConstList([B.E_0, B.E_1, B.E_2]);

// ...

E: function E(t0, t1) {
  this.index = t0;
  this._name = t1;
},

A._Enum.prototype = {
  toString$0(_) {
    return this._enumToString$0();
  }
};

A.E.prototype = {
  _enumToString$0() {
    return "E." + this._name;
  }
};

lrhn commented 10 months ago

The name is required by toString, stated explicitly on the language specification, and toString is notoriously hard to tree shake.

hamsbrar commented 10 months ago

Can't I just nuke T.toString when T.toString is not in-use and there's no upcast?

Like using E1.descriptiveName.index forces compiler to preserve E2.descriptiveName.index as well while it should've stripped both Enum.index and E2.descriptiveName.index of. Probably trying that already or probably doesn't worth the effort(for fields) or probably that's not how it's done(for fields/methods).

The name seems to be there for debugging purpose. Specification doesn't seem to have any notion of production mode so I'm left with minification as the only option that compiler currently isn't doing.

rakudrama commented 10 months ago

Can't I just nuke T.toString when T.toString is not in-use and there's no upcast?

Unfortunately, .toString() is called from many places that can accept anything, for example, Iterable.join and StringBuffer.write.

It is essentially impossible to efficiently prove that the enum values do not pass to those places. The kind of query 'does X definitely not flow here' generally amounts to 'find all the things that flow here and see that X is not included'. If we have a concise summary for 'all the things', e.g. 'an Object' or 'an Enum', then X is included and the test fails to remove X. If we make a precise summary, it is huge, and its computation and representation are generally superlinear in the size of the program. It just doesn't work for real-world applications. Other ways of answering that kind of query are also expensive.

Like using E1.descriptiveName.index forces compiler to preserve E2.descriptiveName.index as well while it should've stripped both Enum.index and E2.descriptiveName.index of. Probably trying that already or probably doesn't worth the effort(for fields) or probably that's not how it's done(for fields/methods).

A further complication is that the base class of enums, _Enum, is where the fields reside. All it takes is for one enum to need the index or _name, and they all get it. So it is really an illusion that index can be removed in a real program.

The name seems to be there for debugging purpose. Specification doesn't seem to have any notion of production mode so I'm left with minification as the only option that compiler currently isn't doing.

Changing the name would change the result of Enum.compareByName. This is not something that a compiler should do by itself.

The best I can come up with is to rename your enum values, and use the 'enhanced enum' feature to add back the descriptive names in the enum scope, something like:

enum E {
  _0,
  _1,
  _2;
  static const descriptiveName1 = _0;
  static const descriptiveName2 = _1;
  static const descriptiveName3 = _2;
}

hamsbrar commented 10 months ago

Unfortunately, .toString() is called from many places that can accept anything, for example, Iterable.join and StringBuffer.write.

It is essentially impossible to efficiently prove that the enum values do not pass to those places. The kind of query 'does X definitely not flow here' generally amounts to 'find all the things that flow here and see that X is not included'. If we have a concise summary for 'all the things', e.g. 'an Object' or 'an Enum', then X is included and the test fails to remove X. If we make a precise summary, it is huge, and its computation and representation are generally superlinear in the size of the program. It just doesn't work for real-world applications. Other ways of answering that kind of query are also expensive.

Here's a rephrased version of logic that's based on my unfounded assumptions:

IS_REQUIRED T.toStr => CANNOT_CHECK T || CHECK_1 T.toStr || CHECK_2 T.toStr || CHECK_3 T.toStr
CHECK_1 T.toStr => there exists a static expression of type T.toStr
CHECK_2 T.toStr => there exists a cast(implicit/explicit) from T to T0 && IS_REQUIRED T0.toStr
CHECK_3 T.toStr => there exists T1 that inherits T.toStr implementation && IS_REQUIRED T1.toStr
CANNOT_CHECK T => T is dynamic || T is escaping program(things like passed to externals)

(ignore termination issues, if there are any, I'm just trying to explain, the best I can, currently)

class A           { toString() => 'a'; }
class B extends A { toString() => 'b'; }
class C extends B { }
class D extends C { toString() => 'd'; }
class E extends D { toString() => 'e'; }

void main {
  print( C().toString() );
  print( StringBuffer.write( E() ) )
}

Run the above logic againt every toString, probably two times:

class A           { /* toString() => 'a'; */ } // removed, IS_REQUIRED(A.toString) => false
class B extends A {    toString() => 'b';    } // kept,    CHECK_3 passes, B.toString is inherited by C and IS_REQUIRED(C.toStr)
class C extends B { }
class D extends C { /* toString() => 'd'; */ } // removed, IS_REQUIRED(D.toString) => false
class E extends D {    toString() => 'e';    } // kept,    CHECK_2 passes, CANNOT_CHECK(E) => true

Of course this is a simple program and compiler could've resulted in just print('b') by eliminating the unreachable code first but I'm curious about performance implications and how stupid these checks really are on "real-world applications"?

A further complication is that the base class of enums, _Enum, is where the fields reside. All it takes is for one enum to need the index or _name, and they all get it. So it is really an illusion that index can be removed in a real program.

I understand now. I was thinking of moving Enum.index to E1 if that's the only enum class using it. Again, probably doesn't worth the effort for fields(especially when I'm doing for straight-forward ones as doing that for super constructors with complex init logic might be unsafe) or This is not the way.

Changing the name would change the result of Enum.compareByName. This is not something that a compiler should do by itself.

If Enum.compareByName is the only issue then I see two options:

Minify them when Enum.compareByName is not in-use.
Minify them in a way that doesn't change their order(easiest method would be to sort and minify them).

If first option is infeasible(or tree shaking this is hard) and sorting is expensive, then how about something like this:

enum E {
  a, // minify this to: a
  aa, // minify this to: aa
  ab, // minify this to: ab
  b, // minify this to: b
  bb, // minify this to: bb
  descriptiveName1, // minify this to: d1
  descriptiveName2, // minify this to: d2
  descriptiveName3, // minify this to: d3
}

// here's how I might try to do it
// just a try. you/others probably know way better methods to get these things done.

void main() {
  var minifier = OrderPreservingMinifier();

  for (var value in E.values) {
    minifier.considerThisLiteralDuringMinification(value.name);
  }

  for (var value in E.values) {
    print('Minified: ${value.name} to: ${minifier.minifyThisLiteralPls(value.name)}');
  }
}

class OrderPreservingMinifier {
  final root = OrderPreservingMinifierNode(parent: null);

  void considerThisLiteralDuringMinification(String literal) {
    var node = root;
    for (final char in literal.split('')) {
      node = (node.children[char] ??= OrderPreservingMinifierNode(parent: node));
    }

    node.isEndOfLiteral = true;
  }

  String minifyThisLiteralPls(String literal) {
    var result = '';

    var node = root;
    for (final char in literal.split('')) {
      node = node.children[char]!;

      var isEndOfLiteral = node.isEndOfLiteral;
      var siblingsCount = node.parent!.children.length - 1;

      if (isEndOfLiteral || 0 < siblingsCount) {
        result += char;
      }
    }

    return result;
  }
}

class OrderPreservingMinifierNode {
  final OrderPreservingMinifierNode? parent;
  final children = <String, OrderPreservingMinifierNode>{};

  OrderPreservingMinifierNode({required this.parent});

  /// Whether a literal from literals being minified ends at this position.
  ///
  var isEndOfLiteral = false;
}

One thing, this considers every character in identifiers as valid leading character for a identifier:

enum E {
  a1, // minifies to: 1
  a2, // minifies to: 2
  a3, // minifies to: 3
}

I don't want to complicate it but this is something that can be changed to match with what's required, if it works.

lrhn commented 10 months ago

Here's a rephrased version of logic that's based on my unfounded assumptions:

IS_REQUIRED T.toStr => CANNOT_CHECK T || CHECK_1 T.toStr || CHECK_2 T.toStr || CHECK_3 T.toStr

CHECK_1 T.toStr => there exists a static expression of type T.toStr

CHECK_2 T.toStr => there exists a cast(implicit/explicit) from T to T0 && IS_REQUIRED T0.toStr

CHECK_3 T.toStr => there exists T1 that inherits T.toStr implementation && IS_REQUIRED T1.toStr

CANNOT_CHECK T => T is dynamic || T is escaping program(things like passed to externals)

It's not wrong. And while this looks complicated, what's really happening is so much worse :)

Notice that:

CHECK_1 applies to Object.toStr (and Object?.toStr). There are definitely places where those happen
operator== usually has parameter type Object.
Ergo, any v1 == v2 makes the type of v2 require toString.

Same for any value passed to Iterable.contains, Map.removeKey, identical, Object.hash or print. One of those is not a false positive, but the types won't tell you which.

Actual tree-shaking needs to be much more precise than these rules, and try to guess which instances can actually flow from A to B. It's not enough to look a the types on the way, because then nothing would get tree-shaken, because everything is eventually Object?.

Then there are generics.

class ToStringer<T> {
  String toStringer(T value) => value.toString();
}
void main() {
  print(ToStringer<MyEnum>(MyEnum.v1));  
}

There is no occurrence of MyEnum.toString() in the program. There is no up-cast from MyEnum to another type, because T is bound to MyEnum where it's used. So we also need to recognize which types T can possbily be, at the point where we do T.toString().

And tree shaking does some of this well, but toString is used so much, in so many places and at so many types and generic types, that it's basically unshakable, unless you keep your type from ever flowing anywhere.

sigmundch commented 10 months ago

Thanks for the engaging discussion, I hope the answers above provide a bit more clarity into the complexity of this issue! I'm just marking this as "closed as not planned" based on the explanations above.

hamsbrar commented 10 months ago

Hello!

Apologies for putting multiple things in the same issue, my bad.

There are two issues here:

Minifying enum names in the output.
Removing enum.name from output when it's not in-use.

The second issue has been addressed(name is required by toString and tree shaking toString is hard using the method I suggested). I'll open a new issue, if there's anything new related to that.

The first issue, is relatively easy. Sorting and then minifying is the safest option. This could be added behind a compiler flag(probably a generic one for such optimizations). One more alternative that doesn't require sorting(so no flags) is also suggested. That can be further improved if it's found to be correct.

so first issue is still pending

(and I can't re-open this thread, it doesn't seem to have any option for a nobody)

sigmundch commented 10 months ago

Thanks for clarifying, I missed that detail.

rakudrama commented 10 months ago

@hamsbrar Can you say more about the impact of the enum names on your application? Does it prevent you from shipping an application? Does it increase the size by X% where X% clearly a significant overhead? Is there some other negative impact?

hamsbrar commented 10 months ago

Can you say more about the impact of the enum names on your application?

There are 1,223 enum values(239 enum types). Minification removes 15,938 out of 19,530 characters.

These are the numbers I get using the method shared in this thread. An improved version of that method(or using sorting+minification) will remove more. Also this doesn't include many enums that I failed to extract and many that lives in third-party packages, framework(not flutter), in SDK/core libraries.

Does it prevent you from shipping an application?

No.

Does it increase the size by X% where X% clearly a significant overhead?

1-3% (3% when everything is lazy, 1% when most are not).

I'm not bothered by the other 97-99%. There definitely are places where I could've done better, like way better than what I actually did. Sometime I just fail not to make stupid mistakes and then things stop being worthy of any effort. But this 1-3% are the characters that no-one on this planet need and compiler knows that.

hamsbrar commented 10 months ago

@lrhn

Notice that:
* _CHECK_1_ applies to `Object.toStr` (and `Object?.toStr`). There are definitely places where those happen

* `operator==` usually has parameter type `Object`.

* Ergo, any `v1 == v2` makes the type of `v2` require `toString`.
Same for any value passed to Iterable.contains, Map.removeKey, identical, Object.hash or print. One of those is not a false positive, but the types won't tell you which.

Actual tree-shaking needs to be much more precise than these rules, and try to guess which instances can actually flow from A to B. It's not enough to look a the types on the way, because then nothing would get tree-shaken, because everything is eventually Object?.

I understand now.

And tree shaking does some of this well, but toString is used so much, in so many places and at so many types and generic types, that it's basically unshakable, unless you keep your type from ever flowing anywhere.

Before I try making a bigger nuke, I do want to know whether toString is actually being used by the example(in this issue)? and is current mechanism failing to eliminate it or it's not trying at all for toStrings?

Also is this one using toString too:

enum E {
  a,b;
  static E random() => false as dynamic ? a : b;
};

switch(E.random()) {
  case E.a:
  case E.b:
  //..
}

if(E.random() == E.a) print('a');
else                  print('b');

hamsbrar commented 10 months ago

Also,

CHECK_1 T.toStr => there exists a static expression of type T.toStr

CHECK_2 T.toStr => there exists a cast(implicit/explicit) from T to T0 && IS_REQUIRED T0.toStr

I see why these checks fail in so many cases, cases which will be somewhat similar to ==.

But these checks can try taking == more seriously:

REQUIRED T.toStr scope => CANNOT_CHECK T
- || CHK_1 T.toStr scope
- || CHK_2 T.toStr scope
- || CHK_3 T.toStr scope
CHK_1 T.toStr scope => there exists a static expression of type T.toStr in scope
CHK_2 T.toStr scope => CHK_2_A T.toStr scope || CHK_2_B T.toStr scope || CHK_2_C T.toStr scope
CHK_2_A T.toStr scope => there exists a cast T to T0 within scope && REQUIRED T0.toStr scope
CHK_2_B T.toStr scope => there exists a call to method m in scope && REQUIRED T.toStr (scopeof m)
CHK_2_C T.toStr scope => there exists a call to method m in scope taking T as T0 && REQUIRED T0.toStr (scopeof m)
CHK_3 T.toStr scope => there exists T1 that inherits T.toStr implementation && REQUIRED T1.toStr scope
CANNOT_CHECK T => T is dynamic || T is escaping program(things like passed to externals)

If I'm not sure about which m(from type that enclose m) gets call in CHK_2_B || CHK_2_C then I'll check all ms(in hierarchy) that can get called at that point.(CHK_2_B is meant to check methods that are accessing a stored instance of T somehow but if instance can be stored as a super type, accessed and missed by the other checks then this check require more work. bit late, will check it later)

Then there are generics.

Can I see this:

// this
void f(A a, B b) {}
// as
void f<T extends A, V extends B>(T a, V b) {}
// and vice-versa

// ----------------------

// this
void f<T, V>(T a, V b) {}
// as
void f<T extends Object, V extends Object>(T a, V b) {}
// and vice-versa

// ----------------------

// this
void f<T extends Object, V extends Object>(T a, V b) {}
// as
void f(Object a, Object b) {}
// and vice-versa

// ----------------------

// this
class ToStringer<T> {
  String toStringer(T value) => value.toString();
}
// as
class ToStringer {
  String toStringer(Object value) => value.toString();
}
// and vice-versa

Doing this for ToStringer might require something more if I can access T as whatever type it's bound to. And something like void f<T extends A implements B, C>() {} definitely require venturing into one more uncharted territory :/

rakudrama commented 10 months ago

I'm not bothered by the other 97-99%. There definitely are places where I could've done better, like way better than what I actually did. Sometime I just fail not to make stupid mistakes and then things stop being worthy of any effort. But this 1-3% are the characters that no-one on this planet need and compiler knows that.

The compiler doesn't know that because some developers do need the name.

We can't minify the name. Developers have come to expect the current behaviour of Enum.toString, and the name is also available to general code via .name. There before we added .name, we would see folks doing enumValue.toString().split('.').last.

hamsbrar commented 10 months ago

The compiler doesn't know that because some developers do need the name.

We can't minify the name. Developers have come to expect the current behaviour of Enum.toString, and the name is also available to general code via .name. There before we added .name, we would see folks doing enumValue.toString().split('.').last.

How about minifying when there's no expression with static type Enum.toString or Enum.name?

But I do stand corrected, there is a possibility that someone might be using toString.split on Object in this planet but I guess they'll be okay with a little surprise. I did write thousands of lines of code in development mode using '$SomeClassName' at some point and then got surprised when that didn't work in production mode. I definitely don't see who is to blame here, the person who want enumValue as Object).toString.split or the person who give them what they need in a way that screw everyone else up :/

hamsbrar commented 10 months ago

Also, a simple flag is always an option.

hamsbrar commented 10 months ago

Took a deeper look today.

Including a new flag to -04 will probably break more code than I initially thought(Flutter is doing toString.split on dynamic). Not including the flag in -04 or adding -05 just for minifying enum names, is too much. Pragma directives (e.g dart2js:minify, dart2js:no-minify) will work but everyone(including compiler) will be maintaining them for who knows how long. The worst part is that, with these options, users will be forced to keep minified names even when they aren't using them(because names are required by Enum.toString). Or in other words, they'll still be paying for others, less than what they're paying now.

Looking at this from a different angle, there's nothing that prevents the compiler from completely eliminating names if they aren't required by toString. If a user is using names(directly Enum.name or through compareByName), they will appreciate that not only compiler kept names but also prevented them from getting minified.

So what's actually required:

Tree shake toString or Remove Enum.name from toString.
Minify names when user is using them only through compareByName.

These changes are going to break a lot of code so probably isn't a good idea. Maybe do this in future or maybe don't. This is the case where I can afford building SDK with changes I want/can-make.

This issue can be closed. thanks (:

hamsbrar commented 10 months ago

I can share accurate numbers now:

# with names
Compiled 14,721,529 input bytes (9,766,634 characters source) to 1,207,213 characters JavaScript in 8.18 seconds using 450.852 MB of memory
# without names
Compiled 14,721,377 input bytes (9,766,581 characters source) to 1,181,984 characters JavaScript in 8.06 seconds using 467.418 MB of memory

These are results of dart compile js -O4 and there are 238 part files. If you're thinking why so many enums in such a small application, well I'm writing templates in HTML and templates weigh far more than Dart output but templates are compiled to many .JS files(just like part files) and loaded on-demand(just like defer loading in Dart).

I can share more information but I doubt you'll need that(--enable-analytics was on, in both SDKs).

I removed these two lines:

https://github.com/dart-lang/sdk/blob/750050ec4088aae07304dad403b6c354a9920488/sdk/lib/core/enum.dart#L112-L113

Also,

IS_REQUIRED T.toStr => CANNOT_CHECK T || CHECK_1 T.toStr || CHECK_2 T.toStr || CHECK_3 T.toStr

CHECK_1 T.toStr => there exists a static expression of type T.toStr

CHECK_2 T.toStr => there exists a cast(implicit/explicit) from T to T0 && IS_REQUIRED T0.toStr

CHECK_3 T.toStr => there exists T1 that inherits T.toStr implementation && IS_REQUIRED T1.toStr

CANNOT_CHECK T => T is dynamic || T is escaping program(things like passed to externals)

Don't know why but these checks don't look as stupid as I initially thought.

The whole program(or should I say all methods that are declared by user) exists outside Object. Not only that, these simple checks will eliminate Enum.toString from example in this issue, and from my codebase too(along with 1200+ enum names) even if I don't give a proper treatment to generics(just keep T.toStr when T is passed to generic something).

Either I've no idea what I'm saying or it's not hard, even when you're required to do it in the first attempt 😄

hamsbrar commented 10 months ago

Also, feel free to close it anytime. Thought I should at least share the correct information.

hamsbrar commented 10 months ago

This should do the job, precisely.

// look at this

class Object {
    String toString();
    String runtimeType();
}

// as

class Object_toString {
    String toString();
}

class Object_runtimeType {
    String runtimeType();
}

// -------------------------------------------------------

class A extends Object {
    void f1() {}
    void f2() {}
    String toString() {}
}

// as

class A_f1 {
    void f1() {}
}

class A_f2 {
    void f2() {}
}

class A_toString extends Object_toString {
    String toString() {}
}

// -------------------------------------------------------

void useF1(A a){
    a.f1();
}

// as 

void useF1(A_f1 a) {
    a.f1();
}

// -------------------------------------------------------

void useF1F2(A a){
    a.f1();
    a.f2();
}

// as

void useF1F2(A_f1 & A_f2 a){
    a.f1();
    a.f2();
}

// -------------------------------------------------------

void useToString(Object o) {
    print(o.toString());
}

// as 

void useToString(Object_toString o) {
    print(o.toString());
}

// -------------------------------------------------------

void useNested(A a) {
    useToString(a);
}

// as

void useNested(Object_toString a) {
    useToString(a);
}

// -------------------------------------------------------

void useEverything(dynamic o) {
    sendToMars(o);
}

Now run these checks, one more time:

IS_REQUIRED T.toStr => CANNOT_CHECK T || CHECK_1 T.toStr || CHECK_2 T.toStr || CHECK_3 T.toStr

CHECK_1 T.toStr => there exists a static expression of type T.toStr

CHECK_2 T.toStr => there exists a cast(implicit/explicit) from T to T0 && IS_REQUIRED T0.toStr

CHECK_3 T.toStr => there exists T1 that inherits T.toStr implementation && IS_REQUIRED T1.toStr

CANNOT_CHECK T => T is dynamic || T is escaping program(things like passed to externals)

I wish I could explain more precisely but I think you'll understand what I'm trying to say.

Update after issue got closed:

The approach above is effective(took help from a professional to verify that). And now I know that I can implement it efficiently(there will be some additional work in few cases and a simple fallback strategy for parameters that I can't proceed further with). They did show me few missing pieces(but those are obvious ones). I did show them the way I see generics and kind of generics supported in Dart and it turns out that there's some work required in that part(access to type arguments in body etc but it's do-able, they also mentioned some exceptions that require proper handling if I want surgical percision but those exceptions are few core types).

So to anyone who is still confused, the problem(tree shaking toString) is/was solved.

The same professional felt sorry for all of you, they really did, and told me to find some place else.

sigmundch commented 10 months ago

Thanks @hamsbrar for the additional discussion and sharing the numbers. Marking it now as closed :)

hamsbrar commented 10 months ago

Ayi-yi-yi you forgot the label this time (:

hamsbrar commented 10 months ago

Now it's looking good. thanks again.