dart-lang / language

Design of the Dart language
Other
2.65k stars 202 forks source link

Add a `parse` static member to each enumerated type #2348

Open eernstg opened 2 years ago

eernstg commented 2 years ago

Cf. https://github.com/dart-lang/sdk/issues/33244. Thanks to @kevmoo for making this proposal!

We can enable a convenient parse operation on each enumerated type using the following idiom:

enum E {
  one,
  two;
  static final parse = values.byName;
}

// Usage.
void main() => print(E.parse('one'));

This issue is a proposal to implicitly induce such a static parse getter to each enumerated type (or a static method with the same usage, if that turns out to have better performance).

Several points of motivation for doing this are mentioned here. The main point is that this translation from a string to an enum value is a commonly occurring task, and doing it should be obvious and convenient.

It could be argued that E.one.toString() yields E.one, and hence the parse method should expect (or at least tolerate) the input 'E.one' rather than just 'one'. It would be easy to write a wrapper function that strips off E. if provided, or adds it if it is missing, but in this particular setting it is important to establish a good default, because it is all about convenience and brevity.

@lrhn, WDYT?

lrhn commented 2 years ago

If anything, It should be a static method, not a getter, so static E? parse(String name) => values.byName(name);. And it should probably accept both "one" and "E.one" as inputs, to be consistent with other parse functions in being an inverse for the toString method.

A problem with adding any members to enums automatically, is that it blocks that name from being used as an enum element name. That's why I'm very hesitant about adding anything to all enum declarations. (Yes, I think index was a mistake, and would prefer to remove it if I could. It should be Enum.indexOf(value) instead.)

It's not at all clear that all enums need or want to have a parse function. Or have one that is based on the declared source name of the elements (the real name, e.g., used for interop, could be something which is not a Dart identifier). Adding a default parse method will block people from having their real parse method.

We could introduce the parse member only if the class doesn't declare anything else with the same base name. It still feels a little weird to me to add methods to other people's classes that they haven't asked for.

We could allow you to declare an "abstract static" member, like static E parse(String ...);, and if you do, we fill it in with a default implementation. Unprecedented, but not completely weird. That said, we'd probably be asked to allow just static parse; to be enough to trigger the default parse function, and then it starts looking a little more weird. Especially since static E parse; would be a valid static variable.

eernstg commented 2 years ago

Adding a default parse method will block people from having their real parse method.

Indeed. It's a little bit like toString: A default implementation could be convenient, simple, and useful for debugging, or it could be elaborate enough to serve some specific application domain purposes, and any concrete choice would be non-optimal for some usages.

However, a standard static parse method which is present in (essentially) all enumerated types seems to be rather benign (especially if it accepts 'E.one' as well as a plain 'one' in the example). Maybe we should have a public survey about it.

munificent commented 2 years ago

A problem with adding any members to enums automatically, is that it blocks that name from being used as an enum element name. That's why I'm very hesitant about adding anything to all enum declarations. (Yes, I think index was a mistake, and would prefer to remove it if I could. It should be Enum.indexOf(value) instead.)

I agree on both counts. It's weird when a namespace contains unrelated kinds of entities, and the namespace of an enum type exists primarily to expose the enum values.

Could we do:

Enum<SomeEnumType>.parse('valueName');

Which is defined something like:

class Enum {
  static T parse<T extends Enum>(String name) => ...
}

Probably also:

class Enum {
  static T? tryParse<T extends Enum>(String name) => ...
}

Of course, the implementation would have to be magic since there's no way to call the static .values getter on a type parameter. Or maybe we need to push on metaclasses or typeclasses (#2194).

jakemac53 commented 2 years ago

The String name is also not really a good way of serializing an Enum value, usually I would use the index.

It would feel pretty weird to me for us to directly support parsing, but use an inefficient mechanism to do so. Yes it is a bit "safer" but only in the circumstances where each side of the communication is compiled with a different version of the library, so you already are basically asking for trouble in that scenario.

munificent commented 2 years ago

The String name is also not really a good way of serializing an Enum value, usually I would use the index.

I mean, the string representation of a floating point number isn't a good way of serializing it, but we still let you parse those. :)

I agree it's probably not a frequently used operation, but it does come up in real code and would be nice to have built-in support for it instead of having to re-implement it manually for every enum type where you need it.

natebosch commented 2 years ago

instead of having to re-implement it manually for every enum type where you need it.

One could also choose to live with writing .values.byName in place of .parse. We do have built-in support for this, but discoverability is in question.

We don't have a nice tryParse variant though, and that is a more interesting feature in my opinion.

annagrin commented 1 year ago

A question - will parsing using values.byName be fast enough (it iterates over the values and compares the strings), or it is better to generate a switch statement on string values?

Context: we parse enum values in dwds on every object returned from chrome several times, so I wondered what is the fastest way to do so.

Update:

enum E {
  object,
  set,
  list,
  map,
  function,
  record,
  type,
  recordType,
  nativeError,
  nativeObject;

  static E parse1(String value) {
    return switch (value) {
      'object' => object,
      'set' => set,
      'list' => list,
      'map' => map,
      'function' => function,
      'record' => record,
      'type' => type,
      'recordType' => recordType,
      'nativeError' => nativeError,
      'nativeObject' => nativeObject,
      _ => object,
    };
  }

  static final parse2 = values.byName;

  static final _valueMap = {
    for (var v in values) '${v.name}': v,
  };

  static E parse3(String value) => _valueMap[value]!;
}

void main(List<String> args) {
  const N = 1000000;
  if (args[0] == 'one') {
    {
      final stopwatch = Stopwatch()..start();
      for (var i = 0; i < N; i++) {
        for (var v in E.values) {
          var parsed = E.parse1(v.name);
        }
      }
      print('Parse1: ${stopwatch.elapsedMilliseconds}ms');
    }
  } else if (args[0] == 'two') {
    final stopwatch = Stopwatch()..start();
    for (var i = 0; i < N; i++) {
      for (var v in E.values) {
        var parsed = E.parse2(v.name);
      }
    }
    print('Parse2: ${stopwatch.elapsedMilliseconds}ms');
  } else {
    final stopwatch = Stopwatch()..start();
    for (var i = 0; i < N; i++) {
      for (var v in E.values) {
        var parsed = E.parse3(v.name);
      }
    }
    print('Parse3: ${stopwatch.elapsedMilliseconds}ms');
  }
}

Output

➜ dart try.dart one Parse1: 97ms ➜ dart try.dart two Parse2: 214ms ➜ dart try.dart three Parse3: 107ms

lrhn commented 1 year ago

Specialized code for the particular enum will almost certainly be faster than generic code that runs through a list and checks the name of each element.

The switch can be optimized in many ways by the compiler, up to an including generating a perfect hash, because it knows all the string values at compile-time. Unless it inlines the byName function, and gets very clever with optimizing, there is no chance that it can be as fast as that, or even as fast as a hash map. You just don't get that kind of performance without code or a data structure optimized for looking up by name.

If your need for speed is great, writing the switch or map manually is a way to get it.

We could (semi-easily) make byName be backed by a compiler-generated map. Can even be a const map. That just risks adding extra code-size to AoT compiled programs, in case the map isn't needed. Even if we can tree-shake unused maps, it's not always the right choice to choose speed over code size. There may be uses of byName which are fine with doing the linear lookup, but wouldn't be fine with adding extra constant maps. The lazily computed map used here is better, but it still has lazy initialization code for each instance. Computing it all inside byName would reuse all the code, and just require a nullable static field, but we can't abstract over those, so it would most likely be an Expando on the values list.

extension EnumByName<E extends Enum> on List<E> {
  static Expando<Map<String, Enum>> _nameMap = Expando();
  E byName(String name) =>
      ((_nameMap[this] ??= {for (var v in this) v.name: v})[name] ??
          (throw ArgumentError.value(
              name, "name", "No enum value with that name"))) as E;
}

Biggest issue is that it's not correct for mutable lists. The extension applies to any list of enum values, but it's only correct to cache for constant lists. And there is no easy way to detect that something is a constant list. (I also don't know how efficient expandos really are. And the as E will also cost.)

rakudrama commented 1 year ago

A question - will parsing using values.byName be fast enough (it iterates over the values and compares the strings), or it is better to generate a switch statement on string values?

Context: we parse enum values in dwds on every object returned from chrome several times, so I wondered what is the fastest way to do so.

Can you (1) characterize how important this is (e.g. it is 1% or 50% of your parsing) and (2) characterize the properties of the enums (e.g. number of enum members, is it expected to stay ~10 or grow to ~100 or more)

Update:

  • switch expression or a string-to-value map implementations run twice as fast in my experiments (see the code bellow, run using vm on macos).
  • platform-specific compiled run times could of course differ, but platform compilers also might optimize switch statements really well:)

I don't believe that switch-on-string is optimized really well on any platform. It degenerates to a sequence of String.== calls on the VM (JIT and AOT) (there is special-case assembly-level code that probably handles your case) and dart2js compiles switch statements to JavaScript switch statements which is essentially a sequence of === operations. dart2js code is sometimes 5x faster than the VM because of clever tricks in the JavaScript VM.

Since 5x is bigger than the effect you mention, which compiler you are using might be the most important factor for answering your question.

The 'sometimes' part is difficult to characterize, but benchmarks often do things just right to get the 5x when in real life the benefit does not materialize. Benchmarking is hard and benchmarking against an adversary like a JIT is harder.

annagrin commented 1 year ago

@rakudrama

Can you (1) characterize how important this is (e.g. it is 1% or 50% of your parsing) and (2) characterize the properties of the enums (e.g. number of enum members, is it expected to stay ~10 or grow to ~100 or more)

(1) I don't have the data for my code but I presume that the percentage would be small (more like 1%). But that is just for my case, I wonder if other cases would prefer faster enums.

(2) The number of enums will grow in my case with any special cases of an object created by DDC-compiled code that we need to translate to vm service Instances, but I would be surprised if it grows to more than ~20.

To clarify - I was't looking into improving my specific case (I can use a map or a switch if we determine that enum parsing speed matters) but start a general discussion on what implementation is better (if this is important, we could create benchmarks).

I don't believe that switch-on-string is optimized really well on any platform. It degenerates to a sequence of String.== calls on the VM (JIT and AOT)

Yeah I was thinking about const strings and stuff... so I think you are right. In the defense of the argument I could say that string comparison would not be worth going over array elements and performing string comparison:) But it looks like it gives an opportunity for other opts to kick in (like whatever worked in my case).

I was using the VM on macos in my experiments.

annagrin commented 1 year ago

@lrhn

Thanks for the example and the explanation! It seems that we are currently optimizing for size - is it always our default choice for new features?

I tried your example below (case 4) and also modified it to not extend a list (case 5), which should solve the problem for modifiable lists.

They didn't perform as good as a static map on enum itself but are still a bit better than the current list.byName parsing. Case 5 is a bit better performing than case 4 though. (UPDATE: both are actually worse than any previous ones, I had a bug in my code, now fixed:)

Not sure if this adds much value, please feel free to postpone:)

enum E {
  object,
  set,
  list,
  map,
  function,
  record,
  type,
  recordType,
  nativeError,
  nativeObject;

  static E parse1(String value) {
    return switch (value) {
      'object' => object,
      'set' => set,
      'list' => list,
      'map' => map,
      'function' => function,
      'record' => record,
      'type' => type,
      'recordType' => recordType,
      'nativeError' => nativeError,
      'nativeObject' => nativeObject,
      _ => object,
    };
  }

  static final parse2 = values.byName;

  static final _valueMap = {
    for (var v in values) '${v.name}': v,
  };

  static E parse3(String value) => _valueMap[value]!;

  static E parse4(String value) => E.values.byName2(value);

  static E parse5(String value) => EnumByName2.byName3(value, E.values);

  static E parse6(int i) => E.values[i];
}

extension EnumByName<N extends Enum> on List<N> {
  static Expando<Map<String, dynamic>> _nameMap = new Expando();
  N byName2(String name) =>
      (_nameMap[this] ??= {for (var v in this) v.name: v})[name] ??
      (throw ArgumentError.value(name, "name", "No enum value with that name"));
}

class EnumByName2 {
  static Expando<Map<String, dynamic>> _nameMap = new Expando();
  static N byName3<N extends Enum>(String name, List<N> values) =>
      (_nameMap[N] ??= {for (var v in values) v.name: v})[name] ??
      (throw ArgumentError.value(name, "name", "No enum value with that name"));
}

void main(List<String> args) {
  const N = 1000000;
  if (args[0] == 'one') {
    {
      final stopwatch = Stopwatch()..start();
      for (var i = 0; i < N; i++) {
        for (var v in E.values) {
          var parsed = E.parse1(v.name);
        }
      }
      print('Parse1: ${stopwatch.elapsedMilliseconds}ms');
    }
  } else if (args[0] == 'two') {
    final stopwatch = Stopwatch()..start();
    for (var i = 0; i < N; i++) {
      for (var v in E.values) {
        var parsed = E.parse2(v.name);
      }
    }
    print('Parse2: ${stopwatch.elapsedMilliseconds}ms');
  } else if (args[0] == 'three') {
    final stopwatch = Stopwatch()..start();
    for (var i = 0; i < N; i++) {
      for (var v in E.values) {
        var parsed = E.parse3(v.name);
      }
    }
    print('Parse3: ${stopwatch.elapsedMilliseconds}ms');
  } else if (args[0] == 'four') {
    final stopwatch = Stopwatch()..start();
    for (var i = 0; i < N; i++) {
      for (var v in E.values) {
        var parsed = E.parse4(v.name);
      }
    }
    print('Parse4: ${stopwatch.elapsedMilliseconds}ms');
  } else if (args[0] == 'five') {
    final stopwatch = Stopwatch()..start();
    for (var i = 0; i < N; i++) {
      for (var v in E.values) {
        var parsed = E.parse5(v.name);
      }
    }
    print('Parse5: ${stopwatch.elapsedMilliseconds}ms');
  } else if (args[0] == 'six') {
    final stopwatch = Stopwatch()..start();
    for (var i = 0; i < N; i++) {
      for (var i = 0; i < E.values.length; i++) {
        var parsed = E.parse6(i);
      }
    }
    print('Parse6: ${stopwatch.elapsedMilliseconds}ms');
  }
}

Results:

➜  dart try.dart one
Parse1: 101ms
➜  dart try.dart two
Parse2: 213ms
➜  dart try.dart three
Parse3: 107ms
➜  dart try.dart four
Parse4: 264ms
➜  dart try.dart five
Parse5: 239ms
➜  dart try.dart six // this one parses by index:)
Parse6: 8ms
jakemac53 commented 1 year ago

@annagrin have you tried just parsing by index? Unless you really care about human readable traffic over the network it should be much better.

annagrin commented 1 year ago

@jakemac53

@annagrin have you tried just parsing by index? Unless you really care about human readable traffic over the network it should be much better.

very true:)

I was hoping that human readable form could be fast as well but it does not look like we have a clear winner now...

lrhn commented 1 year ago

Whether we optimize for speed or size depends on a few things (which mostly means gut-feeling and feedback from the web compiler teams).

We always try to optimize for speed, within the constraints that we have, and the constraint is usually size related, like any of deployment code size, runtime memory usage or allocation churn. All of these are nice to reduce.

Size is a bigger issue if the code is hard to tree-shake. If not using the feature gives you zero code-size in the generated code, it's better than if it's a constant overhead whether you use the feature or not. (In this case, a statc parse function would be a static function, which is perfectly tree-shakable.)

A toString function is notoriously hard to tree-shake, because doing o.toString() on something typed as Object?, or as T where T is a type variable with no bound, makes it a data flow analysis problem to detect which actual toString implementations can reach there, not just type analysis. And since Iterable.toString calls toString on some of its elements (and that code is shared by List and Set too), you can pretty much assume that a toString method is not tree-shaken unless it's on a private class whose instances are never up-cast at all.

Even if code can be tree-shaken, a big piece of code that is almost always included, or smaller code that is included repeatedly like a parse function on each enum, can still increase the deployable code size more than reasonable for its effect. That's where we have to consider trade-offs.

I added the List<T extends Enum>.byName function assuming it would not be time-critical. Going from string to value is something you're most likely to do during parsing, and parsing is one of the slowest thing a computer can do — it's almost all unpredictable branches and lots of independent memory accesses. It is a short, reusable code that works for all enums, without attempting to optimize for anything but size. One short static method, which is optimal for space optimization.

The List<T extends Enum>.asNameMap(), on the other hand, is intended to give you a map from name to value which should be more efficient for repeated lookups. That's what I'd recommend using if you are doing more than a few string-to-enum computations on the same enum, since building the map has a non-trivial cost too.

We could build this map eagerly at compile-time, as a static const map per enum, and it should still be tree-shakable if not used, but it is not something we can use in general code which abstracts over enums (or if we do, we lose tree-shaking).

Or we could add a static parse function per enum, at least those which does not declare another member named parse, doing:

static MyEnum? tryParse(String source) {
  if (source.startsWith('MyEnum.')) source = source.substring('MyEnum.'.length);
  return switch (source) {
    "foo" => MyEnum.foo,
    "theBar" => MyEnum.theBar,
    "quxxius" => MyEnum.quxxius,
    _ => null;
  };
}

Not a lot of code, and a string switch, which is something we can optimize to be very efficient if we want to.

Which brings us back to the original proposal, adding a parse function automatically.

My biggest gripe with a parse/tryParse function is not size or speed, it's being against adding members implicitly in general, and to enums in particular (because we want the name-space open for enum element names), without the author explicitly asking for it. If we add a parse function automatically, without the author asking for it, and someone else uses it, and the author then makes a change which prevents the parse function from being added, that change is breaking, through no fault of the author. We did that to them! They might not even know, and the source gives no warning. (That's why you have to opt-in to things being usable as const, even when the current code could easily just allow it. The compiler should not add guarantees that future APIs have to satisfy, when the author hasn't asked for it.)

If the author can aske for the function, then it's different. Then I start worrying about having a single default implementation, which might not be what the enum author wants. (Do they want to allow MyEnum. as prefix, or are we just wasting time checking for it?) If we start allowing the user to configure which variant of a parse function they want, then we should just let them use macros. And if not, we might risk them using macros anyway, because our default isn't "just right".

You can override the toString of an enum, and if you do, the parse function would just be wrong. So we should not generate a parse function if the enum overrides toString? (We can't assume that the compiler can determine the actual toString result of each enum value, with a user-written toString method, at compile-time, so we can't generate a parse that matches.)

I'm generally against automatic language-introduced parse, toString, toJson, fromJson, and even ==/hashCode except in very restricted cases. Because any default implementation can be wrong for a particular case, and might get more wrong over time if usage patterns shift, but if it's baked into the languge, it's hard/impossible to change. Methods like those are semantic, and words mean things, but if we generate them based on just the syntax, we don't actually know whether we're doing it correctly. Not without extra annotations to guide the generation, and then we're really into macro territory again.

(The toString() of enum declarations, which was only ever intended for debugging, has proven to be completely useless, and a cost on implementations, they can't just return the _name, but have to prefix with EnumName., which nobody wants anyway. Experience has shown us that if there is any string users want, it's the "name" of the enum element. But sometimes there is no name that is relevant, and storing the name only matters for toString.

The public index getter was probably also a mistake.

The evolution path for auto-generated code when usage changes, is to add more configurability, or more different versions of the auto-generated code. Just updating to a new version of a macro package is much more flexible.

So, my suggestion for enum parse methods is that the analyzer introduces a "quick-fix"/template which introduces a static tryParse function into an enum declaration, probably using a switch and just the element names. Users can then modify that function if it's not what they want.

annagrin commented 1 year ago

So, my suggestion for enum parse methods is that the analyzer introduces a "quick-fix"/template which introduces a static tryParse function into an enum declaration, probably using a switch and just the element names. Users can then modify that function if it's not what they want.

That sounds useful!

rakudrama commented 1 year ago

So, my suggestion for enum parse methods is that the analyzer introduces a "quick-fix"/template which introduces a static tryParse function into an enum declaration, probably using a switch and just the element names. Users can then modify that function if it's not what they want.

I would not add a method containing a switch as it is a maintenance burden. It can become inconsistent with the enum declaration.

A switch expression or statement is bloated since it is essentially redundant with values in a way that is impractical to optimize. I would also want to avoid a map literal like {for (final value in values) value.name: value} since it will generate code with a loop for each instance. (If this exact abstract form can be const then it might be acceptable as we might be able to arrange that it shares some structure with values).

So we add a "quick-fix" I would suggest that it use asNameMap(), e.g.

enum MyEnum {
  one,
  two
}

enum MyEnum {
  one,
  two;

  static MyEnum? tryParse(String source) {
    // Comment out these lines to also accept strings with the enum name prefix:
    // const prefix = 'MyEnum.';
    // if (source.startsWith(prefix)) source = source.substring(prefix.length);
    return _nameMap[source];
  }

  static final _nameMap = values.asNameMap();
}