Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
327 stars 208 forks source link

Increase XS `incrementalHeapCount` to reduce rate of GC #7276

Open ivanlei opened 1 year ago

ivanlei commented 1 year ago

What is the Problem Being Solved?

organic GC along allocation paths in vats has high CPU consumption. Increasing incrementalHeapCount in xsCreation to reduce rate of GC, at cost of memory overhead

Description of the Design

once JS heap memory usage is flat (Hypothesis1), we may have a stable rate-of-fxAllocateSlot number to base tuning upon

Security Considerations

Scaling Considerations

We need to tune the configuration to reflect our baselines.

Test Plan

raphdev commented 1 year ago

I've been experimenting with some configurations, mainly under vat 18, which is one of the one most affected by #6661. The default configuration slows quite quickly, and when it comes to no restart versus forced reload from snapshot, the old worker quickly slows to consistently about twice as slow as the reloaded one.

image

Profiling indicates a lot of the time is spent in GC, as execution quickly hits maximums, requiring allocator to make room often: Screenshot 2023-03-31 at 1 26 48 PM

Increasing the initial amounts, and the incremental amounts in particular seemed to improve these contributors to the slowdown:

2x heap amounts and 16x slot amounts (both initial and incremental): image

Initial tweaks were promising, and turns out with larger amounts the original worker starts as faster than reloaded ones. Afterwards I used the sampled profile to guide the tuning a bit. As suggested by @warner I looked to reduce time in GC relative to the sampled population. I also noticed how often requesting new slots of chunks resulted in GC, and thrashing in particular, which require allocating more memory. I only collected profiles of the original worker, the few I did of the reloaded ones showed similar, except smaller percentages GC-wise.

With 8x heap and 32x slot amounts: image

Note: The spike in the middle was due to attaching a debugger and pausing to find a good entry point for the profiling, which throw off the timer as execution is paused. However the rest of the graph shows how timings return to more normal amounts and still show a smaller gap between original and reloaded workers, even if recording is short.

At this point we're spending less time in collection: image

With 16x heap and 32x slots: image

This configuration had an RSS of around 700M, so it may not be ideal, especially given the GC time improvement was marginal:

image

The last experiment was reduced chunk amounts, with larger slots, given the small improvement relative to memory usage.

6x heap and 32x slot amounts (mostly incremental, 4x initial chunk size, and 2x initial slot size):

image

This was the entire vat replay. Sampling around 12k deliveries:

image

Usage was around 400-500M.