Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
327 stars 208 forks source link

Zoe appears to have a memory leak #5910

Closed arirubinstein closed 2 years ago

arirubinstein commented 2 years ago

while debugging the bank vat memory leak, it appears that Zoe's vat also has a leak

Screen Shot 2022-08-06 at 11 26 39 PM

I can provide full slogs, logs, etc requested for this network. I'll let it run over the weekend to generate additional data

https://ui.honeycomb.io/agoric/datasets/instagoric-loadtest/result/G24qAncnkwR

arirubinstein commented 2 years ago

prelim assign to warner, but please reassign

arirubinstein commented 2 years ago

this usage may actually be not as bad as initially thought - over time it seems to decrease which could just be representative of offers-in-flight

Screen Shot 2022-08-07 at 2 28 43 PM
dckc commented 2 years ago

I can try making the workers restart periodically. Every time we take a snapshot is the easiest; might as well start there.

warner commented 2 years ago

or restart the validator every hour or two, if the cosmos/golang side doesn't have any problem with that

erights commented 2 years ago

Yeah, that second set of graphs is interesting. Looking forward to the longer trace which should tell us whether this equilibrates.

warner commented 2 years ago

I figured out that this is unrelated to Zoe: xsnap is leaking memory when writing snapshots. Zoe just happens to get a lot of deliveries, so it writes snapshots more often.

When I looked at the vats whose xsnap workers were consuming the most memory, I built out the following table:

| vatID        | RSS at crash | VmSize | raw snapshot | deliveries | metering.allocate |
|--------------+--------------+--------+--------------+------------+-------------------|
| v7-zoe       |       4.5 GB | 5.5 GB | 30 MB        |     116402 | 84_017_184        |
| v16-bank     |          2.2 |   3.25 | 19 MB        |     125002 | 54_657_056        |
| v28-zcf      |          2.0 |   3.07 | 8.9 MB       |      77202 | 67_239_968        |
| v4-vattp     |          1.0 |   2.06 | 3.9 MB       |     500202 | 42_074_144        |
| v1-bootstrap |          0.8 |   1.81 | 23.1 MB      |      39402 | 71_434_304        |
| v15-zcf      |          0.2 |   1.26 | 6.9 MB       |      39602 | 46_268_448        |
| v8-board     |         0.14 |   1.18 | 3.9 MB       |      40602 | 42_074_144        |

v4-vattp has 4x the deliveries of v7-zoe, so it will have written 4x the snapshots, but the snapshots themselves are almost 10x smaller. I'm guessing the leak is proportional to the size of the snapshot, or the number of objects in the graph, so v7-zoe was growing VmSize faster.

warner commented 2 years ago

I opened #5975 to track the xsnap issue, since Zoe was an innocent bystander. It was fixed temporarily by 9e2c1da92d865ce02dc766b1072c8c3209b0cfe9 , and will be fixed better (using the latest XS release) by https://github.com/Agoric/agoric-sdk/pull/6011 . So I'll close this now, and I'll close #5975 when #6011 lands.