VM GC can be slow with large old-space lists

rakudrama commented 7 years ago

On my workstation, the following program runs in ~30s. If I reduce the list size by removing the * 8 it takes ~8s.

Now I don't know if this program is in any way realistic. How would I tell if my app behaviour has this behaviour? (i.e. the combination of large lists and mutation)

I instrumented the builtin hash map and hash set in runtime/lib to print list allocation sizes of 1M and over. dart2js compiles of a certain large app do create at least two hash maps with lists containing 2M elements. These are likely to be long-lived. (I thought of breaking the hash map into a multi-level hash map, but this is unlikely to help since there will be multiple smaller lists with random access that add up to 2M elements. A segmented approach would work better with lists that have exploitable access patterns, e.g. appending.)

const N = 128 * 1024 * 8;

var a = new List(N);

main() {
  int ix = 0;

  var root = new Link(null);

  var sw = new Stopwatch()..start();

  for (int i = 0; i < 1000; i++) {
    for (int j = 0; j < 1000; j++) {
      a[ix] = new Object();
      var p = root;
      for (int k = 0; k < 1000; k++) {
        p = new Link(p);
      }
      a[ix] = p.length;
      ix = (ix + 14999981).remainder(a.length);
    }
  }

  print(sw.elapsed);
}

class Link {
  Link next;
  Link(this.next);

  int get length {
    var i = 0;
    for (Link p = this; p != null; p = p.next) i++;
    return i;
  }
}

rmacnak-google commented 7 years ago

This might be because the remembered set tracks objects instead of slots.

rakudrama commented 6 years ago

vm folks: can you provide instructions on how to determine if this issue actually affects my app (dart2js on huge programs)? What options do I provide to the VM and how do I interpret the output.

rmacnak-google commented 6 years ago

--- a/runtime/vm/scavenger.cc
+++ b/runtime/vm/scavenger.cc
@@ -527,6 +527,7 @@ bool Scavenger::ShouldPerformIdleScavenge(int64_t deadline) {

 void Scavenger::IterateStoreBuffers(Isolate* isolate,
                                     ScavengerVisitor* visitor) {
+  TIMELINE_FUNCTION_GC_DURATION(Thread::Current(), "IterateStoreBuffers");
   // Iterating through the store buffers.
   // Grab the deduplication sets out of the isolate's consolidated store buffer.
   StoreBufferBlock* pending = isolate->store_buffer()->Blocks();

dart --observe --complete-timeline hello.dart Look for IterateStoreBuffers events in Observatory's timeline and see they dominate the time spent in their enclosing CollectNewGeneration events.

rakudrama commented 6 years ago

Ok, I tried that and in the middle of my 300s dart2js run, it shows most GCs are 60-80% IterateStoreBuffers, e.g. 10.5ms of 15.1ms with another GC in 195ms.

I can estimate dart2js spends 5% of its time (10.5/195) doing IterateStoreBuffers Is there a way to read off the total (rather than estimate)?

How can I find the large array? Things improve in the last 10% of the compile. Either we GC-ed the array or just stopped touching it.

rakudrama commented 6 years ago

@rmacnak-google How can I find the large arrays causing this problem?

rmacnak-google commented 6 years ago

It would be difficult to identify the source of an array from prints in the VM, so I'd do something like:

add a way to pause before each scavenge (so the array you're looking for doesn't go away)
send an inspect service event before the scavenge with the largest object in the store buffer
use the retaining path and backward reference tools in Observatory to find who using that array

rakudrama commented 6 years ago

I looked into this again.

On compiling a large app, IterateStoreBuffers is 60% of CollectNewGeneration.
On repeated (batch) compilation of smaller apps, IterateStoreBuffers is 20% of CollectNewGeneration
If we assume the other work of CollectNewGeneration is roughly the same, the 40% of large apps is the same as 80% of small apps, so large apps are spending 6x more in IterateStoreBuffers.
The large app spends 34s/550s = 6% in IterateStoreBuffers.
If we could save 5/6 of the IterateStoreBuffers time, that would be a reduction of 5%.

What would it take to improve this, e.g. card-mark large _Lists? It looks like it would speed up our large dart2js compilations by 5%

/cc @vsmenon

mraleph commented 5 years ago

The change actually seems to regress dart2js compile time by 3%. Any insight @rmacnak-google @rakudrama?

rakudrama commented 5 years ago

I don't notice any improvement on the bigger compiles either, which is where I would expect something with the huge arrays. acx_gallery is the best golem benchmark to look at.

rmacnak-google commented 5 years ago

Oh... 04941b5507666eabc1ac23027b124939a3794a03 accidentally reverts a fix in 80317f1c096308aca95b2fec98aa424ea08d81df, effectively disabling concurrent marking.

rmacnak-google commented 5 years ago

For CompileAcxGallery, Golem reports a 6.520% improvement with e15e8609aa8610f8c432f1caf2ab89358e2fce50, compared to a -3.533% regression from 04941b5507666eabc1ac23027b124939a3794a03, so card marking was worth the difference.

dart-lang / sdk

VM GC can be slow with large old-space lists #30433