Closed arirubinstein closed 2 years ago
prelim assign to warner, but please reassign
this usage may actually be not as bad as initially thought - over time it seems to decrease which could just be representative of offers-in-flight
I can try making the workers restart periodically. Every time we take a snapshot is the easiest; might as well start there.
or restart the validator every hour or two, if the cosmos/golang side doesn't have any problem with that
Yeah, that second set of graphs is interesting. Looking forward to the longer trace which should tell us whether this equilibrates.
I figured out that this is unrelated to Zoe: xsnap
is leaking memory when writing snapshots. Zoe just happens to get a lot of deliveries, so it writes snapshots more often.
When I looked at the vats whose xsnap workers were consuming the most memory, I built out the following table:
| vatID | RSS at crash | VmSize | raw snapshot | deliveries | metering.allocate |
|--------------+--------------+--------+--------------+------------+-------------------|
| v7-zoe | 4.5 GB | 5.5 GB | 30 MB | 116402 | 84_017_184 |
| v16-bank | 2.2 | 3.25 | 19 MB | 125002 | 54_657_056 |
| v28-zcf | 2.0 | 3.07 | 8.9 MB | 77202 | 67_239_968 |
| v4-vattp | 1.0 | 2.06 | 3.9 MB | 500202 | 42_074_144 |
| v1-bootstrap | 0.8 | 1.81 | 23.1 MB | 39402 | 71_434_304 |
| v15-zcf | 0.2 | 1.26 | 6.9 MB | 39602 | 46_268_448 |
| v8-board | 0.14 | 1.18 | 3.9 MB | 40602 | 42_074_144 |
v4-vattp has 4x the deliveries of v7-zoe, so it will have written 4x the snapshots, but the snapshots themselves are almost 10x smaller. I'm guessing the leak is proportional to the size of the snapshot, or the number of objects in the graph, so v7-zoe was growing VmSize faster.
I opened #5975 to track the xsnap issue, since Zoe was an innocent bystander. It was fixed temporarily by 9e2c1da92d865ce02dc766b1072c8c3209b0cfe9 , and will be fixed better (using the latest XS release) by https://github.com/Agoric/agoric-sdk/pull/6011 . So I'll close this now, and I'll close #5975 when #6011 lands.
while debugging the bank vat memory leak, it appears that Zoe's vat also has a leak
I can provide full slogs, logs, etc requested for this network. I'll let it run over the weekend to generate additional data
https://ui.honeycomb.io/agoric/datasets/instagoric-loadtest/result/G24qAncnkwR