dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.09k stars 1.56k forks source link

Performance degradation after upgrading to newer Flutter with Isolate Groups enabled by default #47672

Open maxim-saplin opened 2 years ago

maxim-saplin commented 2 years ago

A dictionary app with lots of long living objects (millions of strings, hundreds of megabytes in the heap) and multiple isolates (7+) sees 10 second freezes (on Galaxy Note 10) when minimizing the app and returning back to it. Investigating the issues I've identified the following cause:

  1. When returning to app (moving to foreground) garbage collection is triggered on all isolates.
  2. Now that isolate groups are enabled by default (https://github.com/dart-lang/sdk/issues/46754) and Flutter's main isolate shares heap with all isolates created via Isolate.spawn(), GC on child isolates blocks UI
flutter doctor [✓] Flutter (Channel stable, 2.5.3, on macOS 12.0.1 21A559 darwin-x64, locale en-GB) • Flutter version 2.5.3 at /Users/maxim_saplin/flutter • Upstream repository https://github.com/flutter/flutter.git • Framework revision 18116933e7 (4 weeks ago), 2021-10-15 10:46:35 -0700 • Engine revision d3ea636dc5 • Dart version 2.14.4 [!] Android toolchain - develop for Android devices (Android SDK version 31.0.0) • Android SDK at /Users/maxim_saplin/Library/Android/sdk ✗ cmdline-tools component is missing Run `path/to/sdkmanager --install "cmdline-tools;latest"` See https://developer.android.com/studio/command-line for more details. ✗ Android license status unknown. Run `flutter doctor --android-licenses` to accept the SDK licenses. See https://flutter.dev/docs/get-started/install/macos#android-setup for more details. [✓] Xcode - develop for iOS and macOS • Xcode at /Applications/Xcode.app/Contents/Developer • Xcode 13.1, Build version 13A1030d • CocoaPods version 1.11.2 [✓] Chrome - develop for the web • Chrome at /Applications/Google Chrome.app/Contents/MacOS/Google Chrome [✓] Android Studio (version 2020.3) • Android Studio at /Applications/Android Studio.app/Contents • Flutter plugin can be installed from: 🔨 https://plugins.jetbrains.com/plugin/9212-flutter • Dart plugin can be installed from: 🔨 https://plugins.jetbrains.com/plugin/6351-dart • Java version OpenJDK Runtime Environment (build 11.0.10+0-b96-7281165) [✓] VS Code (version 1.62.0) • VS Code at /Applications/Visual Studio Code.app/Contents • Flutter extension version 3.28.0 [✓] Connected device (2 available) • macOS (desktop) • macos • darwin-x64 • macOS 12.0.1 21A559 darwin-x64 • Chrome (web) • chrome • web-javascript • Google Chrome 95.0.4638.69

What is concerning about that issue:

  1. GC happening when resuming the app every time, yet objects in isolates are created upon app launch and used in read-only mode throughout app lifetime. Running heavy GC on any app resume gives little free memory yet consumes a lot of CPU.
  2. WIth the same number of objects and size of heap it takes longer to do GC with more isolates running. In my app I can control how many isolates are created to load dictionaries and do lookups. Effectively with a single param is is possible to spawn the given number of isolates and spread data loaded from files between them. With exactly the same data and the only difference being the number of isolates, here're the measured UI freezes (tested on Android 12 x86 simulator): 1 isolate - 4 seconds, 10 isolates - 24 seconds.
  3. Long UI stall on newer Flutter version with no change to source code

Below are screenshots from Observatory comparing app running 1 and 10 isolates, same data loaded in both cases.

Observatory, home pages, 1 isolate vs 10 isolates image image

Observatory, allocation profile, 1 isolate vs 10 isolates image image

Attached file contains 2 screen recordings which show Memory Performance tab of Flutter DevTools. To the left is a piece of the app. When the app is opened and input field is focused keyboard is shown. When minimizing the app and returning back to it UI is frozen, keyboard stays open until UI thread is freed. In the memory profiler (though with some delay) it is clear that 10 isolates generate way more GC events (blue circles).

DevTools, memory, 1 isolates vs 10.zip

App details The app seeing the performance issues is this one: https://github.com/maxim-saplin/dikt It is a simple 'lookup word -> display article' kind of app, basically a dictionary. With performance being top priority a number of NFR were set (such as loading 10mil words in 90 dictionary files in 2-3 seconds on Galaxy Note 10, conducting a case insensitive lookup within 30ms) a separate package was created (https://github.com/maxim-saplin/ikvpack) which uses custom binary format to store key/values pairs (word/compressed article). The package allows to spawn a number of isolates (isolate pool) which host dictionary handling logic (loading files, searching words) and surfaces API via a sort of proxy/RPC allowing cross-isolate calls. When launching the app N-1 isolates are spawned (N being the number of CPU cores), .ikv files are loaded in parallel and stay in isolates, all lookups happen in isolates with the main thread just consolidating lookup results received from isolates. The general idea is to benefit from multi-core CPUs, off-load all heavy opps to isolates and never allow dictionary operations to slow down UI thread for longer than 10ms.

Following up with @aam and @mkustermann as mentioned in issue #46754.

mraleph commented 2 years ago

I think we need timelines to see what triggers these GCs.

/cc @rmacnak-google @dnfield

mraleph commented 2 years ago

/cc @mit-mit

maxim-saplin commented 2 years ago

I think we need timelines to see what triggers these GCs.

/cc @rmacnak-google @dnfield

Here're the timelines from Flutter Dev Tools:

image dart_devtools_2021_11_11-1636627915085000.json.zip

mraleph commented 2 years ago

Thanks for the timeline. There is one obvious problem - it seems that we receive low memory notification from Flutter (I am not entirely sure how to figure out why we would receive one? do we just get it by default when coming from background? @dnfield might now) - and then we tell all individual isolates to do low memory GCs. (see Isolate::NotifyLowMemory). In reality we should only do low memory GC once per isolate group - doing it in each isolate seems to contribute to stalling.

Screenshot 2021-11-11 at 12 44 29

One of the GCs is disproportionally long as well - it spends 3 seconds of CPU time to chew through ~500Mb heap where other GCs chew through it in ~200ms. I think that is related to CPU scheduling - we probably get sent to a slow / down-clocked core.

@rmacnak-google Do you feel that maybe we should land https://dart-review.googlesource.com/c/sdk/+/203304 and apply it also to LowMemory GCs?

maxim-saplin commented 2 years ago

One of the GCs is disproportionally long as well - it spends 3 seconds of CPU time to chew through ~500Mb heap where other GCs chew through it in ~200ms. I think that is related to CPU scheduling - we probably get sent to a slow / down-clocked core.

JIC, data is not evenly distributed between isolates, with 80+ dictionary files they are being loaded to isolates one-by-one in whatever order FS enums file names, some files can be bigger than others, it is possible that one isolate occasionally chews more than others.

mraleph commented 2 years ago

@maxim-saplin the heap is shared between isolates (the whole point of lightweight isolates was to share the heap) so all of these GCs collect exactly the same heap.

maxim-saplin commented 2 years ago

@maxim-saplin the heap is shared between isolates (the whole point of lightweight isolates was to share the heap) so all of these GCs collect exactly the same heap.

Suspected that yet decided to mention) After all GC is supposed to sweep the entire isolate group in one pass, yet it is not the case..

dnfield commented 2 years ago

Low memory notifications are forwarded from the OS. Making them work per heap rather than per isolate seems like a good improvement.

maxim-saplin commented 2 years ago

Low memory notifications are forwarded from the OS. Making them work per heap rather than per isolate seems like a good improvement.

In my case GC is triggered by mere minimize/resume loop with nothing happening in the app. Even with one GC per isolate group it might be reasonable to optimize further and avoid unnecessary GC wasting CPU cycles and not giving any free bytes. Just thinking out loud.

mraleph commented 2 years ago

@dnfield based on the cursory glance at the source I am starting to suspect that these low memory notifications might be related to onTrimMemory, potentially with TRIM_MEMORY_UI_HIDDEN. Maybe we should not be doing a heavy synchronous full GCs (with compaction) if the system is just telling us that it's a good opportunity to trim memory because UI went in background.

(potentially when UI goes to background OS might be also moving process to a LITTLE cores - which would explain the slowness of the GC, though it is a pure speculation on my part).

dnfield commented 2 years ago

TRIM_MEMORY_UI_HIDDEN is forwarded on as a Dart_NotifyLowMemory. This seems like desirable behavior to me though - when we go to background, we should try to shrink our memory usage as much as possible to avoid the OS killing us for using too much memory as a background process. This is also likely a good time to take our time doing a GC because it won't interrupt user experience of the app.

However, the assumption here is that a synchronous GC will only take somewhere in the 200ms range (maybe 400ms...?). Obviously if GC is going to take 10s that will be a problem when the user comes back. But the fact that this wasn't a problem before isolate groups were on by default is suggesting to me that if we just GC the heap once on this request we should be in better shape again. Would need tracing to see though.

I'm wondering if part of what we're seeing here is scheduling contention where you have a large number of parallel GC threads fighting for core time and not making much progress due to all the scheduling interruptions... It might be easier to see that if we had a full systrace/perfetto trace though.

aam commented 2 years ago

But the fact that this wasn't a problem before isolate groups were on by default is suggesting to me that if we just GC the heap once on this request we should be in better shape again

If I understand correctly all worker isolates keep large amount of (same?) data in memory: dictionary. With heap consolidated between those isolates(via enable-isolate-groups), the heap got x-times bigger, so it's reasonable to expect that GC operations got slower. There should be fewer of them though(well with low memory notification fixed).

I'm not sure whether having same dictionary data loaded into every worker isolate is ideal solution. Perhaps sharding dictionary data by first letter and distributing work across isolates that are responsible for particular shards might have worked better.

rmacnak-google commented 2 years ago

@mraleph I think we should not take marking speed into account during Dart_NotifyLowMemory. If an embedder wants to respond to a low memory signal by doing GC only if would fit in a certain time limit, that is what Dart_NotifyIdle already does.

maxim-saplin commented 2 years ago

But the fact that this wasn't a problem before isolate groups were on by default is suggesting to me that if we just GC the heap once on this request we should be in better shape again

If I understand correctly all worker isolates keep large amount of (same?) data in memory: dictionary. With heap consolidated between those isolates(via enable-isolate-groups), the heap got x-times bigger, so it's reasonable to expect that GC operations got slower. There should be fewer of them though(well with low memory notification fixed).

I'm not sure whether having same dictionary data loaded into every worker isolate is ideal solution. Perhaps sharding dictionary data by first letter and distributing work across isolates that are responsible for particular shards might have worked better.

Each isolate keeps only a portion of data loaded by the app. The data loaded by the app is the same, though each isolates only get a potion (there're 90 files in total and they are spread across isolate pool). In the above screenshots you can see Observatory home pages, first screenshot comes from the build with main isolate + 1 worker isolate. The second one has main + 10 isolates. Notice that despite the number of isolates heap size stays same at around 514MB.

Btw, VS Code reports insane numbers of memory allocated when there're 10 isolates (same configuration as in the screens from Observatory): image

mkustermann commented 2 years ago

Thanks @rmacnak-google for fixing the low-memory signal issue. I've filed https://github.com/dart-lang/sdk/issues/47701 to get this cherry-picked into the beta (soon to be stable)

mraleph commented 2 years ago

@dnfield

TRIM_MEMORY_UI_HIDDEN is forwarded on as a Dart_NotifyLowMemory. This seems like desirable behavior to me though - when we go to background, we should try to shrink our memory usage as much as possible to avoid the OS killing us for using too much memory as a background process. This is also likely a good time to take our time doing a GC because it won't interrupt user experience of the app.

I don't fully agree with this reasoning. I think we should be following "When in Rome, do as the Romans do" mantra here. I have reviewed the code in Android that is associated with dispatching and handling trim-memory and low-memory requests, most specifically AppProfiler.updateLowMemStateLSP as well as ActivityThread.handleLowMemory and ActivityThread.handleTrimMemory. Based on the source code I have reviewed I believe that Android is not going to request a GC on TRIM_MEMORY_UI_HIDDEN (or any other trim level - as far as I can see): when trimming Android is mainly concerned in detaching and synchronously releasing native resources and potentially clearing some Java side caches - the actual memory reclamation would then happen in GC. GCs are only requested in when handling low-memory notification and even then they are not synchronous, but concurrent.

This reinforces my belief that our current handling of trim notifications (and potentially low memory notifications) is overly conservative and rigid. I suggest we stop doing synchronous GCs at least for trimming and instead use these messages simply as a signal to drive our concurrent GC. Furthermore, Flutter Engine should probably react to trim-notifications by purging various non-essential caches and releasing native resources, similar to what Android framework does (unless engine already does it).

dnfield commented 2 years ago

Flutter does free caches in the framework and on the engine side on these memory events.

I don't have any strong opinion on what kind of GC gets run, but we do need to run a GC when this happens or the OS won't see us actually freeing memory and will be more likely to kill us. So if Dart_NotifyLowMemory chose to do an async GC that is probably fine.

mit-mit commented 2 years ago

@maxim-saplin we landed the fix to only interrupt one isolate member (https://github.com/dart-lang/sdk/issues/47701), so could you please try out your app with an updated Flutter master channel build? We're cutting a new release fairly soon, so it would be greatly appreciated if you could do this fairly soon.

dnfield commented 2 years ago

One thing we could try to do is use a NotifyIdle mechanism when we're below TRIM_LEVEL_COMPLETE. The question is what delay to give it.

Alternatively, we could try to add a level to Dart_NotifyLowMemory that we could set higher when Android tells us it is about to kill us (TRIM_LEVEL_COMPLETE), and lower in other cases.

dnfield commented 2 years ago

It would also help to know if this issue is still reproducing on master. In general, going to background is a good time to free resources and if we can still get away with it we'd like to.

maxim-saplin commented 2 years ago

@maxim-saplin we landed the fix to only interrupt one isolate member (#47701), so could you please try out your app with an updated Flutter master channel build? We're cutting a new release fairly soon, so it would be greatly appreciated if you could do this fairly soon.

Just checked Profile build on emulator, all seems good - 1-2 seconds of delay on master compared to ±17 seconds on stable version 2.5.3. Thanks a lot for the quick fix!

Here's how master looks on my end:

Flutter 2.6.0-12.0.pre.759 • channel master • https://github.com/flutter/flutter.git
Framework • revision 9bff08b79b (2 hours ago) • 2021-11-19 00:12:58 -0500
Engine • revision a3ef1d5351
Tools • Dart 2.16.0 (build 2.16.0-20.0.dev) • DevTools 2.8.0

Also checked Release build on Galaxy Note 10, now there's hardly any visible flicker/stutter when returning to the app. Feels way better.

dnfield commented 2 years ago

Can you by any chance share an updated timeline of this behavior?

maxim-saplin commented 2 years ago

Can you by any chance share an updated timeline of this behavior?

Here we go dart_devtools_2021_11_19-1637307652822000.json.zip

dnfield commented 2 years ago

It seems like that timeline doesn't have very much in it...

maxim-saplin commented 2 years ago

It seems like that timeline doesn't have very much in it...

Tried once again with more events toggled in DevTools config, though it seems there're no isolate events displayed in the UI anymore - might be smth is broken in Flutter tooling on master. image

Here's another timeline: dart_devtools_2021_11_19-1637308513596000.json.zip

dnfield commented 2 years ago

That one has a bit more, great. I'm not quite sure what would be up with the tooling though.

It looks like the GC is still taking 3.76s of wall time (2.5 of CPU). It's also not actually freeing much memory. I think this overall makes sense - it's a big heap to scan and there isn't much it can free so it takes a little while. Most of that time is taken by parallel marking, and a good chunk of the parallel marking is where we're losing the wall vs. CPU time.

I'm still not super inclined to change the current behavior of doing a big GC when we go to background - although it'd help to look at more apps, since I suspect this trace is a bit unusual by in terms of its heap size and the amount of it wants to retain.

mraleph commented 2 years ago

It looks like the GC is still taking 3.76s of wall time (2.5 of CPU).

I have read a bit more code and done some experiments and my initial hypothesis that an app going into background also changes how threads of that app are scheduled was indeed correct. Once you switch away from the app it gets put into a special cgroup which limits CPU cores the app can get scheduled on to slower cores (among other things - there is also some scheduler policies changes which I suspect instruct scheduler to allocate less quants to the app) - which means you are basically guaranteed to have much-much longer GC pause if you attempt to perform a full GC while in background.

This explains 3.7s of GC. Collecting the same heap in foreground will take ~200-300ms.

I'm still not super inclined to change the current behavior of doing a big GC when we go to background

To me this strongly indicates that the current approach needs to change - we should use Dart_NotifyIdle instead and also land this CL which would prevent Dart_NotifyIdle from stalling the main thread.

dnfield commented 2 years ago

What value are we supposed to pass to Dart_NotifyIdle in the case of going to background though?

dnfield commented 2 years ago

I'm not opposed to making some different request of the VM when we go to background than Dart_NotifyLowMemory, but I'm not sure how much Dart_NotifyIdle makes sense besides that some values passed to it currently will trigger a concurrent GC as an implementation detail.

mraleph commented 2 years ago

@dnfield to make things cleaner we could add Dart_NotifyTrimMemory

dnfield commented 2 years ago

An alternative would be to add some kind of level parameter to NotifyLowMemory.

mkustermann commented 2 years ago

... Maybe we should not be doing a heavy synchronous full GCs (with compaction) if the system is just telling us that it's a good opportunity to trim memory because UI went in background.

Update on this: Recently 81143e16c4865cc593a239272cdde37f1d1bc8a1 landed which removes running long synchronous GC as response to Dart_NotifyLowMemory().