Open warner opened 1 year ago
Here's the output of packages/SwingSet/misc-tools/monitor-slog-block-time.py
on the blocks immediately before and after the problem:
- block blockTime lag -> cranks(avg) computrons swingset(avg) + cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
7709155 6( 6.5) 6.494 -> 6( 3.0) 188192 0.047( 0.071) + 0.007 = 0% ( 1.0) [ 0/ 0/ 0] r0 ()
7709156 6( 6.5) 6.829 -> 0( 2.6) 0 0.003( 0.069) + 0.015 = 0% ( 1.0) [ 0/ 0/ 0] r0 ()
7709157 6( 6.5) 6.730 -> 0( 2.6) 0 0.003( 0.069) + 0.005 = 0% ( 1.0) [ 0/ 0/ 0] r0 ()
7709158 6( 6.5) 6.707 -> 0( 2.6) 0 0.002( 0.069) + 0.003 = 0% ( 1.0) [ 0/ 0/ 0] r0 ()
7709159 6( 6.5) 6.698 -> 0( 2.4) 0 0.003( 0.067) + 0.007 = 0% ( 1.0) [ 0/ 0/ 0] r0 ()
7709160 6( 6.2) 6.680 -> 6( 2.6) 175502 0.047( 0.069) + 0.007 = 0% ( 1.0) [ 0/ 0/ 0] r0 ()
7709161 6( 6.2) 6.706 -> 6( 3.0) 188192 0.045( 0.071) + 0.007 = 0% ( 1.0) [ 0/ 0/ 0] r0 ()
7709162 6( 6.2) 6.697 -> 0( 3.0) 0 0.003( 0.071) + 0.003 = 0% ( 1.0) [ 0/ 0/ 0] r0 ()
7709163 6( 6.2) 11.264 -> 0( 2.6) 0 0.002( 0.052) + 0.003 = 0% ( 0.9) [ 0/ 0/ 0] r0 ()
7709164 10( 6.2) 7.137 -> 0( 2.6) 0 0.002( 0.052) + 0.006 = 0% ( 0.9) [ 0/ 0/ 0] r0 ()
7709165 6( 6.2) 7.069 -> 950( 50.1) 239910012 20.942( 1.099) + 2.592 = 79% ( 4.9) [ 0/ 0/ 0] r0 ()
7709166 6( 6.2) 26.562 -> 1171(108.7) 65050750 31.495( 2.674) + 0.021 = 94% ( 9.6) [ 0/ 0/ 0] r0 ()
7709167 25( 7.2) 37.407 -> 1022(159.4) 65101294 19.384( 3.641) + 0.006 = 81% (13.6) [ 0/ 0/ 0] r0 ()
7709168 36( 8.7) 22.958 -> 1321(225.5) 65111031 28.176( 5.050) + 0.003 = 92% (18.2) [ 0/ 0/ 0] r0 ()
7709169 22( 9.5) 35.419 -> 1235(287.2) 65027410 122.143(11.157) + 0.033 = 95% (23.0) [ 0/ 0/ 0] r0 ()
7709170 34(10.9) 143.230 -> 652(319.9) 65181729 11.763(11.745) + 0.197 = 37% (24.9) [ 0/ 0/ 0] r0 ()
7709171 142(17.7) 17.811 -> 654(351.1) 65342884 74.501(15.428) + 0.046 = 94% (29.0) [ 0/ 0/ 0] r0 ()
7709172 17(18.2) 80.691 -> 654(383.9) 65342884 24.169(16.637) + 0.023 = 81% (33.1) [ 0/ 0/ 0] r0 ()
7709173 80(21.9) 29.061 -> 684(418.1) 65276501 29.048(18.089) + 0.006 = 87% (37.4) [ 0/ 0/ 0] r0 ()
7709174 28(23.0) 31.720 -> 710(453.6) 65053869 54.125(20.795) + 0.005 = 97% (42.3) [ 0/ 0/ 0] r0 ()
- block blockTime lag -> cranks(avg) computrons swingset(avg) + cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
7709175 31(24.2) 66.002 -> 710(488.8) 65024668 29.351(22.260) + 0.075 = 72% (45.9) [ 0/ 0/ 0] r0 ()
7709176 65(27.2) 34.031 -> 710(524.2) 65335032 55.021(25.011) + 0.006 = 93% (50.6) [ 0/ 0/ 0] r0 ()
7709177 33(28.6) 66.113 -> 740(561.2) 65120336 14.889(25.756) + 0.008 = 59% (53.5) [ 0/ 0/ 0] r0 ()
7709178 65(31.5) 22.919 -> 734(598.0) 65324306 20.057(26.758) + 0.021 = 74% (57.2) [ 0/ 0/ 0] r0 ()
7709179 22(32.3) 25.699 -> 736(634.8) 65077924 30.291(28.273) + 0.011 = 86% (61.6) [ 0/ 0/ 0] r0 ()
7709180 25(33.2) 34.337 -> 651(667.0) 65132172 25.147(29.528) + 0.004 = 88% (65.9) [ 0/ 0/ 0] r0 ()
7709181 33(34.6) 33.710 -> 603(696.9) 65353411 27.246(30.888) + 0.005 = 79% (69.8) [ 0/ 0/ 0] r0 ()
7709182 33(36.0) 45.166 -> 603(727.0) 65348874 25.336(32.154) + 0.057 = 59% (72.8) [ 0/ 0/ 0] r0 ()
7709183 44(37.9) 38.478 -> 603(757.1) 65351449 27.391(33.524) + 0.029 = 69% (76.3) [ 0/ 0/ 0] r0 ()
7709184 37(39.2) 33.069 -> 603(787.3) 65358283 29.154(34.981) + 0.009 = 87% (80.7) [ 0/ 0/ 0] r0 ()
7709185 32(40.5) 36.877 -> 603(770.0) 65361309 28.131(35.341) + 0.012 = 80% (80.7) [ 0/ 0/ 0] r0 ()
7709186 36(42.0) 32.898 -> 1070(764.9) 65035483 24.044(34.968) + 0.005 = 86% (80.3) [ 0/ 0/ 0] r0 ()
7709187 32(42.4) 26.881 -> 1671(797.4) 65080484 32.178(35.608) + 0.016 = 94% (81.0) [ 0/ 0/ 0] r0 ()
7709188 26(41.9) 39.221 -> 1671(814.9) 65080484 18.501(35.124) + 0.050 = 75% (80.1) [ 0/ 0/ 0] r0 ()
7709189 38(42.6) 46.131 -> 499(778.0) 65642797 36.767(30.855) + 0.011 = 58% (78.2) [ 0/ 0/ 0] r0 ()
7709190 45(43.2) 56.552 -> 485(769.7) 65598691 18.171(31.176) + 0.005 = 49% (78.8) [ 0/ 0/ 0] r0 ()
7709191 55(38.9) 31.413 -> 484(761.2) 65591264 14.048(28.153) + 0.022 = 54% (76.8) [ 0/ 0/ 0] r0 ()
7709192 30(39.5) 21.235 -> 484(752.7) 65587795 12.058(27.548) + 0.006 = 67% (76.1) [ 0/ 0/ 0] r0 ()
7709193 20(36.5) 14.711 -> 484(742.7) 65589535 12.031(26.697) + 0.003 = 89% (76.2) [ 0/ 0/ 0] r0 ()
7709194 14(35.8) 18.415 -> 484(731.4) 65590781 10.169(24.499) + 0.059 = 64% (74.6) [ 0/ 0/ 0] r0 ()
- block blockTime lag -> cranks(avg) computrons swingset(avg) + cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
7709195 12(34.9) 27.489 -> 484(720.1) 65592283 20.771(24.070) + 0.028 = 65% (74.2) [ 0/ 0/ 0] r0 ()
7709196 14(32.3) 45.895 -> 485(708.9) 65600037 11.029(21.870) + 0.022 = 48% (72.0) [ 0/ 0/ 0] r0 ()
blocks continue to be full (about 65M computrons each) until block 7709257, where the queue is finally drained and we're able to catch up:
- block blockTime lag -> cranks(avg) computrons swingset(avg) + cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
7709197 39(19.8) 22.831 -> 484(363.2) 65596431 9.008(12.744) + 0.005 = 64% (60.9) [ 68/ 0/ 68] r3 (BB5E2C)
7709198 22(20.2) 15.073 -> 561(402.8) 65563866 21.843(14.564) + 0.003 = 80% (64.9) [ 72/ 0/ 72] r2 (BB5E2C)
7709199 14(19.2) 24.778 -> 619(438.8) 65005698 8.583(13.567) + 0.003 = 82% (67.8) [ 65/ 0/ 65] r1 (BB5E2C)
7709200 19(19.1) 16.110 -> 617(464.3) 65285790 8.785(12.884) + 0.006 = 83% (70.0) [ 73/ 0/ 73] r0 (2A3AC9)
7709201 10(18.0) 20.463 -> 619(483.6) 65005545 8.849(12.380) + 0.027 = 61% (68.9) [ 71/ 0/ 71] r0 (CFC45F)
7709202 15(17.7) 15.666 -> 618(498.6) 65290598 8.964(12.000) + 0.003 = 87% (71.0) [ 71/ 0/ 71] r1 (CFC45F)
7709203 10(16.9) 15.298 -> 619(510.6) 65005683 8.846(11.685) + 0.022 = 93% (73.2) [ 73/ 0/ 73] r0 (E29E61)
7709204 10(16.3) 20.500 -> 617(520.3) 65284344 8.575(11.402) + 0.015 = 57% (71.7) [ 71/ 0/ 71] r0 (23E1B0)
7709205 14(16.1) 15.608 -> 619(528.5) 65006237 9.291(11.226) + 0.003 = 94% (73.7) [ 71/ 0/ 71] r1 (23E1B0)
7709206 10(15.6) 29.852 -> 818(550.8) 65119238 25.071(12.291) + 0.015 = 62% (72.8) [ 72/ 0/ 72] r0 (47C2F8)
7709207 15(15.6) 49.541 -> 1002(583.0) 65108717 28.765(13.468) + 0.049 = 75% (73.0) [ 67/ 0/ 67] r1 (BF9BF3)
7709208 32(16.7) 46.817 -> 1005(611.1) 65016753 18.499(13.803) + 0.047 = 97% (74.6) [ 65/ 0/ 65] r2 (BF9BF3)
7709209 33(17.7) 48.654 -> 1006(635.8) 65012879 35.034(15.130) + 0.011 = 68% (74.2) [ 72/ 0/ 72] r1 (996453)
7709210 28(18.3) 56.596 -> 966(655.2) 65178791 27.960(15.885) + 0.020 = 96% (75.5) [ 70/ 0/ 70] r2 (996453)
7709211 40(19.5) 45.159 -> 924(670.2) 65002751 10.241(15.571) + 0.004 = 94% (76.6) [ 70/ 0/ 70] r1 (BB5E2C)
7709212 35(20.3) 21.085 -> 924(683.5) 65004071 10.268(15.292) + 0.003 = 93% (77.5) [ 71/ 0/ 71] r1 (BB5E2C)
7709213 12(19.9) 40.445 -> 542(676.5) 250284328 44.655(16.760) + 0.006 = 67% (77.0) [ 73/ 0/ 73] r0 (9A4CCB)
- block blockTime lag -> cranks(avg) computrons swingset(avg) + cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
7709214 15(19.9) 71.403 -> 416(697.2) 65365580 8.509(16.677) + 0.059 = 86% (78.2) [ 71/ 0/ 71] r1 (026653)
7709215 49(21.8) 39.396 -> 415(693.8) 65358871 7.060(15.992) + 0.003 = 45% (77.2) [ 76/ 0/ 76] r1 (026653)
7709216 9(21.6) 37.957 -> 415(690.3) 65357798 7.450(15.813) + 0.004 = 93% (79.4) [ 72/ 0/ 72] r0 (3B6773)
7709217 8(20.0) 37.968 -> 415(686.9) 65358636 20.962(16.410) + 0.045 = 97% (81.0) [ 76/ 0/ 76] r0 (CFC45F)
7709218 9(19.4) 52.596 -> 408(679.2) 65059232 5.153(15.576) + 0.008 = 66% (80.3) [ 67/ 0/ 67] r0 (E34C31)
7709219 24(19.9) 42.313 -> 405(668.5) 65404941 4.542(15.374) + 0.006 = 34% (77.9) [ 75/ 0/ 75] r1 (E34C31)
7709220 6(19.2) 41.316 -> 405(657.9) 65404691 4.611(15.165) + 0.009 = 91% (78.3) [ 77/ 0/ 77] r0 (E14F81)
7709221 7(19.1) 43.266 -> 405(647.2) 65403452 4.622(14.954) + 0.005 = 51% (77.8) [ 77/ 0/ 77] r0 (27C9BF)
7709222 6(18.6) 42.273 -> 405(636.5) 65404398 4.466(14.729) + 0.005 = 92% (78.1) [ 76/ 0/ 76] r0 (E29E61)
7709223 6(18.4) 41.145 -> 940(652.6) 65072377 19.573(15.265) + 0.003 = 97% (78.3) [ 77/ 0/ 77] r0 (2A3AC9)
7709224 6(18.2) 57.550 -> 1238(683.6) 65004024 12.381(15.456) + 0.003 = 81% (79.5) [ 71/ 0/ 71] r0 (BB5E2C)
7709225 18(18.4) 56.115 -> 1405(723.0) 65005301 11.064(15.544) + 0.021 = 72% (78.4) [ 70/ 0/ 70] r0 (D97CBE)
7709226 18(18.8) 52.699 -> 551(709.6) 65016601 6.953(14.638) + 0.006 = 66% (78.6) [ 78/ 0/ 78] r1 (D97CBE)
7709227 10(18.6) 50.483 -> 336(676.3) 65316156 5.489(13.475) + 0.017 = 86% (79.2) [ 74/ 0/ 74] r0 (A21688)
7709228 7(17.3) 49.547 -> 336(642.9) 65315525 15.434(13.321) + 0.005 = 96% (79.2) [ 77/ 0/ 77] r0 (A177B0)
7709229 12(16.2) 55.343 -> 337(609.4) 65321205 5.984(11.869) + 0.015 = 71% (79.3) [ 61/ 0/ 61] r1 (765A78)
7709230 19(15.8) 42.847 -> 336(577.9) 65315624 5.864(10.764) + 0.004 = 92% (79.1) [ 69/ 0/ 69] r1 (765A78)
7709231 8(14.2) 41.466 -> 385(551.0) 65191168 4.897(10.497) + 0.009 = 86% (78.7) [ 82/ 0/ 82] r0 (BB89FD)
7709232 6(12.8) 43.377 -> 390(524.2) 65708481 4.795(10.223) + 0.005 = 61% (77.1) [ 79/ 0/ 79] r0 (0DB33A)
7709233 6(12.4) 55.678 -> 391(516.7) 65714431 5.134( 8.247) + 0.029 = 27% (75.1) [ 80/ 0/ 80] r0 (CFC45F)
- block blockTime lag -> cranks(avg) computrons swingset(avg) + cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
7709234 7(12.1) 54.316 -> 391(515.5) 65713231 4.657( 8.055) + 0.003 = 90% (75.3) [ 71/ 0/ 71] r0 (5F329F)
7709235 6( 9.9) 53.643 -> 390(514.2) 65708086 4.610( 7.932) + 0.012 = 87% (77.3) [ 80/ 1/ 81] r0 (843872)
7709236 11(10.0) 47.731 -> 943(540.6) 65126968 10.045( 8.062) + 0.006 = 95% (77.4) [ 67/ 0/ 67] r1 (843872)
7709237 8(10.0) 50.380 -> 1287(584.2) 65398089 22.792( 8.153) + 0.004 = 97% (77.4) [ 69/ 0/ 69] r0 (BB5E2C)
7709238 8( 9.9) 73.646 -> 1642(645.9) 65050770 10.633( 8.427) + 0.016 = 55% (76.9) [ 75/ 0/ 75] r0 (C22CB0)
7709239 20( 9.8) 65.480 -> 336(642.5) 65445662 6.789( 8.540) + 0.057 = 85% (79.4) [ 73/ 0/ 73] r0 (06AB35)
7709240 15(10.2) 58.000 -> 336(639.0) 65314320 5.479( 8.583) + 0.004 = 89% (79.3) [ 75/ 0/ 75] r1 (06AB35)
7709241 7(10.2) 57.175 -> 336(635.5) 65313635 6.510( 8.678) + 0.030 = 90% (81.3) [ 75/ 0/ 75] r0 (9E6A8A)
7709242 6(10.2) 58.190 -> 336(632.1) 65313202 6.688( 8.789) + 0.007 = 93% (81.3) [ 75/ 0/ 75] r0 (E29E61)
7709243 6(10.2) 61.644 -> 342(602.2) 278147375 21.960( 8.908) + 0.005 = 88% (80.9) [ 75/ 0/ 75] r0 (A9F209)
7709244 7(10.2) 77.019 -> 386(559.6) 65570782 4.479( 8.513) + 0.029 = 91% (81.4) [ 74/ 0/ 74] r0 (2A3AC9)
7709245 21(10.4) 71.150 -> 390(508.9) 65708256 4.443( 8.182) + 0.023 = 29% (79.2) [ 74/ 0/ 74] r1 (2A3AC9)
7709246 6( 9.8) 70.097 -> 390(500.8) 65707574 5.156( 8.092) + 0.016 = 91% (80.5) [ 75/ 1/ 76] r0 (BCF291)
7709247 7( 9.7) 73.149 -> 390(503.5) 65707249 4.486( 8.042) + 0.005 = 47% (78.5) [ 76/ 0/ 76] r0 (CFC45F)
7709248 11( 9.8) 67.156 -> 391(506.2) 65714344 4.719( 7.506) + 0.005 = 90% (78.2) [ 79/ 0/ 79] r1 (CFC45F)
7709249 6( 9.6) 78.223 -> 1131(546.0) 65078323 10.731( 7.743) + 0.003 = 46% (77.0) [ 78/ 0/ 78] r0 (BB5E2C)
7709250 6( 8.9) 86.226 -> 1339(596.1) 65038299 11.131( 8.007) + 0.003 = 77% (76.2) [ 70/ 0/ 70] r0 (8BA38C)
7709251 11( 9.1) 87.041 -> 1554(654.5) 65029239 22.834( 8.904) + 0.005 = 97% (76.7) [ 73/ 0/ 73] r0 (0DB33A)
7709252 10( 9.2) 100.462 -> 1399(705.0) 65051279 12.786( 9.303) + 0.034 = 95% (78.4) [ 73/ 0/ 73] r0 (0FED4E)
7709253 22(10.1) 91.695 -> 1114(741.1) 65069571 20.106(10.052) + 0.017 = 97% (82.0) [ 73/ 0/ 73] r1 (DC4CCA)
- block blockTime lag -> cranks(avg) computrons swingset(avg) + cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
7709254 16(10.5) 96.330 -> 1409(792.0) 65086102 11.202(10.379) + 0.017 = 95% (82.2) [ 73/ 0/ 73] r1 (1CE894)
7709255 21(11.2) 87.210 -> 1485(846.8) 65039781 12.390(10.768) + 0.027 = 94% (82.6) [ 74/ 0/ 74] r1 (1CE894)
7709256 9(11.2) 91.243 -> 941(846.7) 61174777 22.553(11.393) + 0.003 = 97% (82.7) [ 74/ 0/ 74] r0 (4EBC59)
7709257 11(11.3) 103.351 -> 6(782.6) 219917 0.040(10.256) + 0.025 = 10% (78.3) [ 73/ 0/ 73] r0 (E29E61)
7709258 39(12.8) 64.821 -> 0(700.5) 0 0.002( 9.724) + 0.005 = 1% (75.6) [ 80/ 2/ 82] r3 (32B4F1)
7709259 6(12.2) 59.245 -> 0(683.8) 0 0.002( 9.385) + 0.005 = 1% (71.4) [ 82/ 0/ 82] r0 (E29E61)
7709260 11(11.9) 48.725 -> 6(667.2) 188192 0.037( 9.113) + 0.008 = 8% (67.4) [ 82/ 0/ 82] r1 (E29E61)
7709261 6(11.9) 43.447 -> 0(650.5) 0 0.002( 8.787) + 0.003 = 0% (62.9) [ 82/ 1/ 83] r0 (BB5E2C)
7709262 11(12.2) 32.892 -> 0(633.6) 0 0.002( 8.453) + 0.003 = 1% (58.3) [ 83/ 1/ 84] r1 (BB5E2C)
7709263 6(12.2) 27.301 -> 0(616.5) 0 0.002( 7.355) + 0.009 = 2% (54.0) [ 84/ 0/ 84] r0 (CFC45F)
7709264 6(12.1) 21.721 -> 0(597.2) 0 0.002( 7.131) + 0.003 = 1% (49.5) [ 85/ 0/ 85] r0 (843872)
7709265 6(11.3) 16.118 -> 6(578.0) 175502 0.034( 6.911) + 0.015 = 11% (48.6) [ 85/ 0/ 85] r0 (A21688)
7709266 6(11.3) 10.625 -> 0(558.5) 0 0.002( 6.653) + 0.003 = 1% (44.1) [ 85/ 1/ 86] r0 (026653)
7709267 7(11.3) 11.318 -> 6(539.4) 194537 0.337( 6.446) + 0.019 = 4% (41.9) [ 86/ 0/ 86] r0 (2A3AC9)
7709268 10(11.3) 7.278 -> 6(520.1) 175502 0.042( 6.212) + 0.004 = 0% (37.4) [ 86/ 0/ 86] r1 (2A3AC9)
7709269 6(11.3) 11.994 -> 6(463.9) 188192 0.043( 5.678) + 0.023 = 0% (35.1) [ 87/ 0/ 87] r0 (0CCD5C)
7709270 11(11.6) 6.819 -> 6(397.2) 175502 0.040( 5.123) + 0.007 = 0% (31.3) [ 87/ 0/ 87] r1 (0CCD5C)
7709271 6(11.3) 6.994 -> 6(319.8) 175502 0.043( 3.983) + 0.005 = 0% (26.5) [ 87/ 0/ 87] r0 (0DB33A)
7709272 6(11.1) 7.102 -> 0(249.8) 0 0.003( 3.344) + 0.004 = 0% (21.7) [ 87/ 0/ 87] r0 (5F329F)
7709273 6(10.3) 6.982 -> 0(194.2) 0 0.003( 2.339) + 0.006 = 0% (16.8) [ 87/ 1/ 88] r0 (23E1B0)
- block blockTime lag -> cranks(avg) computrons swingset(avg) + cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
7709274 6( 9.8) 7.113 -> 0(123.7) 0 0.006( 1.779) + 0.003 = 0% (12.0) [ 88/ 1/ 89] r0 (BB5E2C)
7709275 6( 9.1) 7.395 -> 0( 49.5) 0 0.002( 1.160) + 0.003 = 0% ( 7.3) [ 89/ 0/ 89] r0 (3B4946)
7709276 6( 8.9) 7.408 -> 6( 2.7) 181847 0.045( 0.035) + 0.006 = 0% ( 2.5) [ 89/ 0/ 89] r0 (A177B0)
7709277 6( 8.7) 12.279 -> 6( 2.7) 175502 0.076( 0.036) + 0.023 = 0% ( 2.0) [ 89/ 0/ 89] r0 (CFC45F)
We'll start by examining block 7709165 (which recorded 240M computrons), and 7709213 (250Mc), and 7709243 (280Mc). It's possible that the Big JS Hammer evaluation included a single delivery that took a very long time to complete (not yielding it's time slice until well beyond the per-block limit of 65Mc), followed by a long run-out of additional work (which was properly segmented into separate blocks). And maybe this happened three times.
We've identified three potential issues:
runPolicy
(which limits the amount of work we do in each block, by ending the block after a "bean counter" has grown beyind the configured limit) is apparently not counting the create-vat
/ startVat
events properly (#6639)
agd query swingset params
and look for beans_per_unit.blockComputeLimit
)beans_per_unit.vatCreation
)startVat
delivery message)startVat
spends its 92Mc, then the runPolicy says "ok time to end that block", then the second startVat
runs in the second block, then all the remaining work happens in subsequent blocks
- I'm worried that we're somehow iterating through all registered/provisioned vat-wallet accounts and notifying everybody that they have new zero-balance Purses available
Yes... each smart wallet is waiting for new vbank assets such as DAI_axl and DAI_grv:
notifiers / subscriptions adds a new wrinkle to the "only do work proportional to the size of the incoming message" design constraint.
cc @michaelfig @turadg @erights
proposal 17 on NG has a pretty clean view of the code that started running.
Or try agd --node=https://main.rpc.agoric.net:443 query gov proposal 17
.
startPSM.makeAnchorAsset(noMinting, { options: { anchorOptions } })
starts a mintHolder
startPSM.startPSM(permittedPowers, config)
starts a contractGovernor
psm
analytics.inter.trade shows over 400 smart wallets provisioned. Each of them was waiting to hear about new vbank assets:
nifty chart of block times and missing validators (ack: polkachu)
The Munin graphs on a follower node show the state-size growth when the proposal was executed:
6 new XS heap snapshots, one per new vat:
but the new vats are small, so the total size consumed by heap snapshots only grew by about 20MB (compressed, stored as plain files):
The initial/startup transcripts for those new vats consumed about 150MB (uncompressed, stored in SQLite) (note: the transcript space normally grows by about 40MB per day)
The kvStore
size grew by 12MB:
(for comparison, on a node that prunes rather aggressively, the cosmos-side state directory normally grows by about 780MB/day)
The blocks that included a startVat
consumed a lot of computrons (which the runPolicy
didn't correctly account for), but they didn't actually require a lot of runtime.
The first block was 7709165, which handled the proposal execution. It consumed 240M computrons ("240Mc"), and 20.942s of swingset execution time. The two startVat
s inside it each took 88Mc and 1.8s. The runPolicy
bug meant the startVat
s only counted 300kc towards the block's 65Mc limit, so the remaining 64.4Mc were spent on other deliveries (which required 17.34s). That's not great but not too bad, leading to a block time of maybe 25s-27s (which, because of the way blockTime
is assigned to a later block, appears in my output two blocks later, as the 7709167 25
line)
- block blockTime lag -> cranks(avg) computrons swingset(avg) + cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
7709165 6( 6.2) 7.069 -> 950( 50.1) 239910012 20.942( 1.099) + 2.592 = 79% ( 4.9) [ 0/ 0/ 0] r0 ()
7709166 6( 6.2) 26.562 -> 1171(108.7) 65050750 31.495( 2.674) + 0.021 = 94% ( 9.6) [ 0/ 0/ 0] r0 ()
7709167 25( 7.2) 37.407 -> 1022(159.4) 65101294 19.384( 3.641) + 0.006 = 81% (13.6) [ 0/ 0/ 0] r0 ()
This is followed by a series of blocks which each use the full 65Mc. The amount of wallclock time they take varies, however, from as little as 4s, to a lot of 20-30s blocks, a pair of 55s, and a single 122s outlier.
7709168 36( 8.7) 22.958 -> 1321(225.5) 65111031 28.176( 5.050) + 0.003 = 92% (18.2) [ 0/ 0/ 0] r0 ()
7709169 22( 9.5) 35.419 -> 1235(287.2) 65027410 122.143(11.157) + 0.033 = 95% (23.0) [ 0/ 0/ 0] r0 ()
Each block is using the expected 65M-ish computrons, but some kinds of deliveries take more wallclock time than others (we always hoped the computrons/second ratio would be roughly constant; this is what happens when it is not, and when our mix of delivery types is not random enough to even things out). So some blocks take a lot more time than others, and of course we can't just end a block after five seconds without losing determinism.
I'm still digging into the longer blocks. I think we're seeing the same "long time to serialize" problem that we observed in an earlier testnet, which we thought we fixed by updating balances less frequently. But if we're updating a whole bunch of wallet balances at the same time, we're going to be performing the same kind of serialization in a row, which exacerbates the uneven computrons/second ratio problem.
p.s. see
BTW, the total proposal execution (which introduced two denoms) required:
We're guessing this is proportional to (number of denoms introduced) x (number of provisioned wallet accounts)
. So one (slight) mitigation is to only add one denom at a time, which would give the chain more time to recover before taking on the second half.
With 410 wallets provisioned, the execution requirements for a single denom will probably be:
per provisioned wallet.
FYI, I've confirmed a slowdown of XS workers over time, which probably contributed to this issue. This specific degradation should not occur and we're investigating it. That said it's not a cause in this case, and probably just exacerbated the core problems identified above.
We've identified three potential issues:
I recall another from our discussions:
We had a postmortem meeting today, some of the conclusions:
The slowdown was caused by a multiplication of several factors, all of which are worth investigating and fixing:
To address the per-wallet work, we need to make things more lazy, and/or look for ways to just do less work per vat.
To address the XS metering precision, we should sort the cranks by their wallclock / computrons
ratio, then look carefully at the biggest vs the smallest, and try to figure out which low-level JS operations were involved in each. Perhaps we could run a lot of the same-sized deliveries in a loop, while we point a profiler at the worker, and build up a flamegraph of which XS .c
functions get called during each. If we're lucky there may be a few specific XS functions that account for lots of wallclock but not a lot of computrons, and we can change their metering behavior to increase the computron consumption. Likewise there might be code that increments the meter a lot but actually runs very quickly, which could use adjusting in the opposite direction.
The breadth-first scheduling doesn't have an obvious fix. When we have a more sophisticated kernel scheduler, we might be able to do better, but the small number of vats involved means the kernel wouldn't have a lot of criteria with which it could distinguish between work done for user 1 vs user 2, etc.
We need to replay these vats and add GC instrumentation measure how frequently organic GC is happening, how long it takes, and whether this time was a significant contributor to the wallclock/computron
variance. We might address this through XS changes, but my hunch is that finishing virtualization of the vats involved will make the most difference, to shrink the JS heaps and keep GC fast.
Anything we do to speed things up or do less work will help with problem 1 (the chain being unavailable for 35 minutes). Users care more about that.
Anything we can do to make execution more uniform (and make the per-block computron limit more effective) will help with problem 2 (the handful of long blocks, causing voting/consensus/stability problems). Validators care more about that.
The breadth-first scheduling doesn't have an obvious fix. When we have a more sophisticated kernel scheduler, we might be able to do better, but the small number of vats involved means the kernel wouldn't have a lot of criteria with which it could distinguish between work done for user 1 vs user 2, etc.
I believe @dtribble's suggestion to push new work generated by a vat on the top of the runqueue instead of the bottom would have effectively prevented this breadth-first issue.
historical note: it looks like we knew about #6652 as far back as May 2022
We just didn't schedule that work for the release.
We're investigating a slowdown of the mainnet chain that happend around block 7709170 , about 10:30am PST on monday 05-dec-2022. Several validators fell behind during this time, with one block being delayed by over two minutes.
Preliminary investigating suggests that a governance proposal passed and became active about this time, causing a large chunk of JS to be executed via the "Big JS Hammer" (core-bootstrap
eval
) upgrade pathway. This can be expected to take non-trivial time to complete, however the work should have been broken up into multiple blocks. It's possible that the computron limit was not being enforced correctly, allowing the block to take longer than it should have.We'll collect information and findings in this ticket.