Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
326 stars 206 forks source link

Performance impact of BOYD #4160

Closed mhofman closed 2 years ago

mhofman commented 2 years ago

Describe the bug

The bring out your dead PR (https://github.com/Agoric/agoric-sdk/pull/3801) caused a jump in swingset time. There also seem to be more time spent in swingset during a block over the life of the chain, which is usually indicative of a memory leak.

To Reproduce

Steps to reproduce the behavior:

  1. Follow instructions to setup loadgen
  2. Start a loadgen test with the following parameters: --stage.duration=10 --stage.loadgen.vault.interval=12 --stage.loadgen.amm.interval=12 --stage.loadgen.amm.wait=6 --stage.loadgen.vault.limit=10 --stage.loadgen.amm.limit=10

Expected behavior

Not such a significant increase in swingset time. No further increase over the life of the chain

Platform Environment

Additional context

The PR doesn't change the GC schedule so the impact is likely from the different accounting logic that BYOD introduces.

Screenshots

"sdkRevision","sdkCommitTime","success","walletDeployDuration","loadgenDeployDuration","cycleSuccessRate","cycleAvgDuration","stage1_cycleSuccessRate","stage1_cycleAvgDuration","stage1_avgSwingsetBlockTime","stage1_avgSwingsetPercentage","stage2_cycleSuccessRate","stage2_cycleAvgDuration","stage2_avgSwingsetBlockTime","stage2_avgSwingsetPercentage","stage3_cycleSuccessRate","stage3_cycleAvgDuration","stage3_avgSwingsetBlockTime","stage3_avgSwingsetPercentage","stage4_cycleSuccessRate","stage4_cycleAvgDuration","stage4_avgSwingsetBlockTime","stage4_avgSwingsetPercentage"
"da7c924aa","2021-11-01 23:00:26",true,238.067808,94.07785,100,219.902715,100,214.911984,1.254112,24.89,100,218.745497,1.391519,27.61,100,226.627465,1.402919,27.84,100,220.151802,1.425838,28.14
"9a47516a3","2021-11-03 01:10:09",false,238.950638,98.910524,100,234.284053,100,235.138661,1.713511,34.03,100,228.431566,1.821268,36.14,100,239.761833,2.357044,47.26,,,,

Swingset percentage over time

NB: Stage 4 is missing due to an unrelated chain stall on restart

warner commented 2 years ago

Maybe a good ticket for our new hire.

Tartuffo commented 2 years ago

FYI @arirubinstein

Tartuffo commented 2 years ago

We could probably get a lot of benefit from some tuning of parameters, probably worth doing that for MN-1, and further fine tuning controls for MN-1.1.

warner commented 2 years ago

@FUDCo and I figure that we should set this to like 1 BOYD per 1000 deliveries for now.

The risk is allowing one vat (which doesn't get constant traffic) to hold onto objects that keep other vats from dropping things. Chip points out that this is only really a problem if it impedes operations. We think we could build a follower node that periodically makes a separate copy of the kernel state, then detaches that copy from the chain, and sends a BOYD to all vats, and measures the change in the kernel object table size. This would let us keep track of how much garbage is "floating" on the chain, waiting for a BOYD, and that might tell us 1: it's not that much, we don't need to worry about it, or 2: we should do something at the next convenient point of upgrade.

Tartuffo commented 2 years ago

We still need to choose the number and test it out.

Tartuffo commented 2 years ago

@arirubinstein has done some benchmarking and has a proposed number based on that benchmarking.

There are two places it can get set - default value in the code itself (currently set to 1), and the config file for the swingset you are actually running.

We need to decide where to make the change. My preference is to change the default.

Tartuffo commented 2 years ago

@FUDCo Brian and I just discussed, can you please start working on this?