DRT_InstanceOf uses 49% of total CPU

scheglov commented 3 years ago

Interestingly, I observe this only when I'm running benchmarks, i.e. run Dart from Dart and drive Dart Analysis Server with requests. Maybe too intensive, and VM does not get time to do some background optimization work?

With var extensions = exportElements.whereType<ExtensionElement>().toList();

With

      var extensions = <ExtensionElement>[];
      for (var element in exportElements) {
        if (element is ExtensionElement) {
          extensions.add(element);
        }
      }

a-siva commented 3 years ago

@scheglov can you provide us the command line to reproduce this situation. //cc @alexmarkov

scheglov commented 3 years ago

You will need https://dart-review.googlesource.com/c/sdk/+/220070

I run it with time dart /Users/scheglov/Source/Dart/sdk.git/sdk/pkg/analysis_server/benchmark/benchmarks.dart run das-flutter --flutter-repository=/Users/scheglov/Source/flutter.

FYI

scheglov@scheglov-macbookpro2:~/Source/Dart/sdk.git/sdk (ss-completion-benchmark-output)$ dart --version
Dart SDK version: 2.16.0-edge.b34f39985971d099994e93012cb3b2319712e6b6 (be) (Wed Nov 10 20:38:34 2021 +0000) on "macos_x64"

alexmarkov commented 3 years ago

In this code whereType<ExtensionElement>().toList() is not fully inlined, so WhereTypeIterator.moveNext ends up testing _source.current is T for an unknown type argument T. This kind of type test currently involves subtype test cache and a slow call into runtime. I'm not sure why cache doesn't work in this case, but it is possible that it simply overflows as cache is shared among all invocations of WhereTypeIterator.moveNext and all whereType with different types.

Manual loop performs element is ExtensionElement for a known class ExtensionElement which is generated much more efficiently.

alexmarkov commented 3 years ago

I tried passing --max_subtype_cache_entries=1000 VM option to the analysis server, and it eliminates DRT_InstanceOf from profile. This means that previously runtime was called due to an overflow of subtype test cache.

alexmarkov commented 3 years ago

I see the following ways of improving performance in this case:

Make sure whereType<Foo>().toList() is fully inlined and type test is T is optimized if type argument T becomes known after inlining. Note that it may increase code size if we always inline all involved methods.
Improve performance of generic is T type tests by generating specialized type testing stubs for is tests (currently they are used only for as tests and parameter type checks).
Improve performance of generic is T type tests by using a more smart caching strategy: we could use a hash table for large caches to avoid linear lookup and make cache size unlimited.

@mraleph @mkustermann @sstrickl WDYT?

mkustermann commented 3 years ago

To maybe rephrase @alexmarkov 's suggestions (and add one) for improving general is checks:

Pay with code size: Similar to TTS we emit specialized machine code per type (possibly composable, which they are currently not) => @sstrickl has made some prototype of making TTS work for is checks, we could evaluate the code size cost of that on some of our apps (and the analyzer in this case)
Pay with memory: Similar to STC use a faster caching mechanism => Using hashing over linear search would improve performance. => Allowing unlimited cache size risk having hard-to-debug/reproduce "memory leaks" in real world apps
Increase speed of doing the actual type checks: Would be good to understand why they are so slow => Possibly by faster calls to our runtime type system (e.g. use FFI calls to avoid heavy native transitions) => Possibly by having a runtime type system implementation in generated code (e.g. to avoid the handle overhead we pay in the C++ implementation)

sstrickl commented 1 year ago

I expect that https://dart-review.googlesource.com/c/sdk/+/308941 and its recent predecessors have probably improved this situation, since STCs are now larger and also are hash-based at most sizes for quick lookup, though a TTS-like solution to is checks to avoid hitting the STC in the first place (versus the currently generated inline checks, which are limited due to being per-call site) would probably finish it off if we find in the future that InstanceOf is still a chokepoint.

mkustermann commented 1 year ago

Thanks for the update, @sstrickl !

The one issue we have with <obj> is <type> checks is that - compared to as checks, the STCs will fill up with positive & negative answers. So there can be many different <obj> classes flowing into the check, making the caches very large.

A concrete example that makes the CFE slow (/cc @jensjoha @johnniwinther ) is a stack implementation that's on the hot path (see here):

  Object? pop(NullValue? nullValue) {
    final Object? value = array[--arrayLength];
    ...
    if (value is! NullValue) {
      return value;
    }
    ...
  }

There's many different kinds of objects being pop()ed from the stack (subtypes of NullValue may even be the rare case). So the STC caches would fill up considerably.

Though for this case (and other similar situations) we could do an exhaustive inline cid-range check for the NullValue class - as there's only two types implementing it (the NullValue class itself and enum NullValues implements NullValue<Object>). Currently this is NullValue requires 2 cid-ranges (due to implements) which makes our compiler not optimize it atm - but we can change the compiler to emit exhaustive cid-range checks even if we have more than 1 cid range.

sstrickl commented 1 year ago

It actually should do an exhaustive cid-range check for the NullValue class, if I'm reading the block at https://github.com/dart-lang/sdk/blob/85de9c1d7ee818c75d165877ce1a97b0cddca71f/runtime/vm/compiler/backend/flow_graph_compiler.cc#L2984 correctly, unless the number of cid ranges are greater than kMaxNumberOfCidRangesToTest (currently, 4).

(The fact that we saw it hit the STC, since we needed 03048322028cde44d70a34b6a0a70c160e88fa7b to reduce the overhead until stubs handled hash-based caches, makes it clear that it doesn't and now I'm curious as to why as I'd only expect 2 cid ranges at most here unless I'm forgetting something... maybe there is an additional range for the bottom types as well, but those should be consecutive and thus a single additional range, making it 3 ranges.)

mkustermann commented 1 year ago

It actually should do an exhaustive cid-range check for the NullValue class, if I'm reading the block at

I think the code you refer to is for FlowGraphCompiler::GenerateCallerChecksForAssertAssignable aka as-checks but the CFE example is using is checks against NullValue

sstrickl commented 1 year ago

... right. Yes. Of course. 🤦‍♀️

dart-lang / sdk

DRT_InstanceOf uses 49% of total CPU #47680