dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.08k stars 1.56k forks source link

Isolates scaling problem #55713

Open jensjoha opened 4 months ago

jensjoha commented 4 months ago

On the CFE team we have quite a bit of "suites", to speed things up (and integrate with the approval system and whatnot) we have a "unit test suites" runner which launches the suites in isolates (using #cores-1 isolates).

My machine has 12 cores (with hyperthreading) and so it launches 11 isolates. According to htop though it only uses 4-500% CPU though. Prepending a run with time gives this:

(Runs on f774ec53e7b107664baf65a50395a717c3de26a2)

$ time out/ReleaseX64/dart pkg/front_end/test/unit_test_suites.dart --skipTestsThatRequireGit
[...]
real    10m34.861s
user    41m52.658s
sys     10m22.388s

Using (user+sys)/real this becomes ~494% CPU.

If instead launching two processes, asking each to do half and only use 5 isolates each:

time out/ReleaseX64/dart pkg/front_end/test/unit_test_suites.dart --skipTestsThatRequireGit -j5 --shards 2 --shard 1
time out/ReleaseX64/dart pkg/front_end/test/unit_test_suites.dart --skipTestsThatRequireGit -j5 --shards 2 --shard 2

(run in two different terminals at the same time)

I get these numbers:

[...]
real    5m31.369s
user    18m8.244s
sys     2m56.053s

Using (user+sys)/real this becomes ~381% CPU.

and

[...]
real    5m39.273s
user    18m16.937s
sys     2m54.928s

Using (user+sys)/real this becomes ~374% CPU.

Adding up these numbers it becomes ~755% CPU (or ~747% if doing (user1+sys1+user2+sys2)/max(real1, real2)) --- in both cases quite a bit more than the <500% (more than 50% more actually). Looking at the (real) runtime it's almost twice as fast (~87% faster, taking ~53.5% of the time).

So: Are isolates expected to be slower, is it being held wrong or is something else up?

@mraleph suggested that "the answer is probably our GC does not scale well for this usecase" and suggested I file a bug, so here we are.

rmacnak-google commented 3 months ago

While the GC could do better here, I think this is mostly from uneven work distribution across the isolates. I noticed the timeline often had fewer open HandleMessage events than the number of workers, especially toward the end. Here I've grouped the events by isolate instead of thread. (-j12, also --shards 4 --shard 1 because otherwise the trace is too big for Catapult to handle).

Screenshot from 2024-05-28 14-34-14 Screenshot from 2024-05-28 14-34-28

--- a/pkg/front_end/test/unit_test_suites.dart
+++ b/pkg/front_end/test/unit_test_suites.dart
@@ -6,6 +6,7 @@ import 'dart:async' show Timer;
 import 'dart:convert' show jsonEncode;
 import 'dart:io' show File, Platform, exitCode;
 import 'dart:isolate' show Isolate, ReceivePort, SendPort;
+import 'dart:developer';

 import 'package:args/args.dart' show ArgParser;
 import 'package:testing/src/chain.dart' show CreateContext, Result, Step;
@@ -502,9 +503,11 @@ class SuiteConfiguration {
 }

 Future<void> runSuite(SuiteConfiguration configuration) async {
+  var t = new TimelineTask();
   Suite suite = configuration.suite;
   String name = suite.prefix;
   String fullSuiteName = "$suiteNamePrefix/$name";
+  t.start(fullSuiteName);
   Uri suiteUri = Platform.script.resolve(suite.path ?? "${name}_suite.dart");
   if (!new File.fromUri(suiteUri).existsSync()) {
     throw "File doesn't exist: $suiteUri";
@@ -528,6 +531,7 @@ Future<void> runSuite(SuiteConfiguration configuration) async {
     shards: suite.shardCount,
     shard: configuration.shard,
   );
+  t.finish();
   if (logger.gotFrameworkError) {
     throw "Got framework error!";
   }
jensjoha commented 3 months ago

Let's try something slightly different to make it more apples to apples:

import "dart:isolate";

import "unit_test_suites.dart" as suite;

void entry(List<String> args) {
  suite.main(args);
}

Future<void> main() async {
  ReceivePort exitPort1 = new ReceivePort();
  await Isolate.spawn(entry,
      ["--skipTestsThatRequireGit", "-j5", "--shards", "2", "--shard", "1"],
      onExit: exitPort1.sendPort);
  ReceivePort exitPort2 = new ReceivePort();
  await Isolate.spawn(entry,
      ["--skipTestsThatRequireGit", "-j5", "--shards", "2", "--shard", "2"],
      onExit: exitPort2.sendPort);
  await exitPort1.first;
  await exitPort2.first;
  print("done");
}

vs

time out/ReleaseX64/dart pkg/front_end/test/unit_test_suites.dart --skipTestsThatRequireGit -j5 --shards 2 --shard 1
time out/ReleaseX64/dart pkg/front_end/test/unit_test_suites.dart --skipTestsThatRequireGit -j5 --shards 2 --shard 2

(run in two different terminals at the same time)

This should do the exact same thing, with the exact same work distribution, just using either isolates or processes.

Using the isolates approach I get

[...]
Entire run took 0:07:25.244488.
[...]
Entire run took 0:07:25.573479.
done

real    7m38.368s
user    27m46.395s
sys     8m43.946s

So it finished in 7 minutes 38 seconds real runtime, using ~478% CPU.

Running in processes I get

[...]
Entire run took 0:03:50.552587.

real    4m4.012s
user    12m31.571s
sys     2m36.091s

and

[...]
Entire run took 0:04:09.658705.

real    4m23.415s
user    13m25.618s
sys     2m41.306s

So it finished in 4 minutes 23 seconds real runtime, using ~712% CPU. That's ~48% more CPU usage and finishing in <58% of the time (being more than 74% faster.)

jensjoha commented 2 months ago

With a few things landed I thought I'd check the status --- here as of f3e1b210de5ba5f99980daef19603bf11caf13a3:

Using the isolates approach I get

Entire run took 0:08:01.846790.
[...]
Entire run took 0:08:06.291553.
[...]

real    8m19.025s
user    29m12.886s
sys     8m45.307s

So it finished in 8 minutes 19 seconds real runtime, using ~456% CPU.

Running in processes I get

Entire run took 0:04:24.835708.

real    4m38.342s
user    13m33.095s
sys     2m40.488s

and

Entire run took 0:04:14.214064.

real    4m27.796s
user    12m56.087s
sys     2m35.223

So it finished in 4 minutes 38 seconds real runtime, using ~685% CPU. That's ~50% more CPU usage and finishing in <56% of the time (being more than 79% faster.)

mkustermann commented 1 month ago

We (mainly @mraleph actually) were recently looking at performance of isolates and one outcome was that the migration of isolates across OS threads, together with dynamic cpu frequency scaling can make certain workloads go 50%+ slower. Every time an isolate becomes idle due to async work pending, it may be resumed on another thread (which may be assigned to different core by OS and may need to increase cpu frequency, ...).

One fix for this would be to ensure that when new work arrives we try to pick the idle thread that was most recently put back into the idle list in the thread pool instead of possibly waking up the thread that was longest idle (which may have it's core put into low frequency ...)

This may be better fix than https://github.com/dart-lang/sdk/commit/ad33dff4e9cd613defe294935beef1e296ed2975 (which keeps a thread in the (limited sized) thread pool busy (not usable by others) while actually being idle

jensjoha commented 1 month ago

ad33dff certainly seems to fix the issue about the clash with the linux cpu governor in the example I'm looking at (which before I had to hack around by setting the governor to performance):

At 50939f3d53cd546831792d4a3a458c3b9515927d: With governor "schedutil" (the default on my machine):

$ out/ReleaseX64/dart-sdk/bin/dart --deterministic pkg/front_end/test/isolates_v_processes.dart --processes -j1
Using 1 processes...
Process #1 (checking startup cost) finished after 0:00:09.037829
Process #0 finished after 0:01:58.637536
All done after 0:01:58.638838
$ out/ReleaseX64/dart-sdk/bin/dart --deterministic pkg/front_end/test/isolates_v_processes.dart --isolates -j1
Using 1 isolates...
Isolate #1 (checking startup cost) finished after 0:00:00.708356
Isolate #0 finished after 0:02:53.582366
All done after 0:02:53.584488

With governor "performance":

$ out/ReleaseX64/dart-sdk/bin/dart --deterministic pkg/front_end/test/isolates_v_processes.dart --processes -j1
Using 1 processes...
Process #1 (checking startup cost) finished after 0:00:08.173280
Process #0 finished after 0:01:44.757958
All done after 0:01:44.758897
$ out/ReleaseX64/dart-sdk/bin/dart --deterministic pkg/front_end/test/isolates_v_processes.dart --isolates -j1
Using 1 isolates...
Isolate #1 (checking startup cost) finished after 0:00:00.656622
Isolate #0 finished after 0:01:46.301763
All done after 0:01:46.303148

ad33dff4e9cd613defe294935beef1e296ed2975:

With governor "schedutil":

$ out/ReleaseX64/dart-sdk/bin/dart --deterministic pkg/front_end/test/isolates_v_processes.dart --processes -j1
Using 1 processes...
Process #1 (checking startup cost) finished after 0:00:09.167323
Process #0 finished after 0:01:47.758234
All done after 0:01:47.759746
$ out/ReleaseX64/dart-sdk/bin/dart --deterministic pkg/front_end/test/isolates_v_processes.dart --isolates -j1
Using 1 isolates...
Isolate #1 (checking startup cost) finished after 0:00:00.683944
Isolate #0 finished after 0:01:45.084126
All done after 0:01:45.086155

Looking at the scaling with the number of isolates, though, we're still in trouble. The below is at current head (744a20355b79190727b12b46c215ed563af818f1).

Here I've run e.g. out/ReleaseX64/dart-sdk/bin/dart --deterministic pkg/front_end/test/isolates_v_processes.dart --isolates -j1 and out/ReleaseX64/dart-sdk/bin/dart --deterministic pkg/front_end/test/isolates_v_processes.dart --processes -j1 (with different j configs). The runtimes rounded down to the nearest second (and with only 1 run) is presented in this table:

j setting Isolates Processes
1 103 106
2 70 69
4 93 47
6 116 43
8 127 43

and here attempted in a "mermaid xychart" where the bar is processes and the line is isolates.

xychart-beta
    title "Isolates v processes (JIT)"
    x-axis [j = 1, j = 2, j = 4, j = 6, j = 8]
    y-axis "seconds" 0 --> 150
    bar [106, 69, 47, 43, 43]
    line [103, 70, 93, 116, 127]

and for completion the same thing with an aot compile:

j setting Isolates Processes
1 71 76
2 52 43
4 79 25
6 94 20
8 105 17

and again attempted in a "mermaid xychart" where the bar is processes and the line is isolates.

xychart-beta
    title "Isolates v processes (AOT)"
    x-axis [j = 1, j = 2, j = 4, j = 6, j = 8]
    y-axis "seconds" 0 --> 110
    bar [76, 43, 25, 20, 17]
    line [71, 52, 79, 94, 105]
mkustermann commented 1 month ago

https://github.com/dart-lang/sdk/commit/ad33dff4e9cd613defe294935beef1e296ed2975 certainly seems to fix the issue about the clash with the linux cpu governor in the example I'm looking at (which before I had to hack around by setting the governor to performance):

That may be so, but it's a hack and can regress other workloads (as it blocks limited size thread pool workers). So it may be better to solve the root cause - instead of adding a workaround. I believe @mraleph is working on a fix (or has one already)

Looking at the scaling with the number of isolates, though, we're still in trouble.

I think this is mainly due to spending too much time in GC and not enough time running the isolates. It's already 4x faster when using a much larger new space for example (--new_gen_semi_max_size=500) in --isolates -j8.

mkustermann commented 1 month ago

That may be so, but it's a hack and can regress other workloads ...

Checking go/lem: ad33dff4e9cd613defe294935beef1e296ed2975 regressed e.g. IsolateFibonacci by 100% in aot mode (i.e. it become twice as slow)

jensjoha commented 3 weeks ago

Updated numbers as of today (run on 6873fcbff05468b599857550097694247d4b13ac) (aot compile, linux with governor "performance"):

j setting Isolates Processes
1 71 71
2 45 40
4 34 23
8 33 16

and here attempted in a "mermaid xychart" where the bar is processes and the line is isolates.

xychart-beta
    title "Isolates v processes (JIT)"
    x-axis [j = 1, j = 2, j = 4, j = 8]
    y-axis "seconds" 0 --> 80
    bar [71, 40, 23, 16]
    line [71, 45, 34, 33]

All in all quite a bit better, but processes are ~twice as fast as isolates for 8 cores.