Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
327 stars 208 forks source link

investigating mainnet slowdown around block 7709170 (05-dec-2022) #6625

Open warner opened 1 year ago

warner commented 1 year ago

We're investigating a slowdown of the mainnet chain that happend around block 7709170 , about 10:30am PST on monday 05-dec-2022. Several validators fell behind during this time, with one block being delayed by over two minutes.

Preliminary investigating suggests that a governance proposal passed and became active about this time, causing a large chunk of JS to be executed via the "Big JS Hammer" (core-bootstrap eval) upgrade pathway. This can be expected to take non-trivial time to complete, however the work should have been broken up into multiple blocks. It's possible that the computron limit was not being enforced correctly, allowing the block to take longer than it should have.

We'll collect information and findings in this ticket.

warner commented 1 year ago

Here's the output of packages/SwingSet/misc-tools/monitor-slog-block-time.py on the blocks immediately before and after the problem:

- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709155    6( 6.5)   6.494 ->    6(  3.0)     188192  0.047( 0.071)  +   0.007 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709156    6( 6.5)   6.829 ->    0(  2.6)          0  0.003( 0.069)  +   0.015 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709157    6( 6.5)   6.730 ->    0(  2.6)          0  0.003( 0.069)  +   0.005 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709158    6( 6.5)   6.707 ->    0(  2.6)          0  0.002( 0.069)  +   0.003 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709159    6( 6.5)   6.698 ->    0(  2.4)          0  0.003( 0.067)  +   0.007 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709160    6( 6.2)   6.680 ->    6(  2.6)     175502  0.047( 0.069)  +   0.007 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709161    6( 6.2)   6.706 ->    6(  3.0)     188192  0.045( 0.071)  +   0.007 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709162    6( 6.2)   6.697 ->    0(  3.0)          0  0.003( 0.071)  +   0.003 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709163    6( 6.2)  11.264 ->    0(  2.6)          0  0.002( 0.052)  +   0.003 =   0% ( 0.9) [  0/  0/  0] r0 ()
  7709164   10( 6.2)   7.137 ->    0(  2.6)          0  0.002( 0.052)  +   0.006 =   0% ( 0.9) [  0/  0/  0] r0 ()
  7709165    6( 6.2)   7.069 ->  950( 50.1)  239910012 20.942( 1.099)  +   2.592 =  79% ( 4.9) [  0/  0/  0] r0 ()
  7709166    6( 6.2)  26.562 -> 1171(108.7)   65050750 31.495( 2.674)  +   0.021 =  94% ( 9.6) [  0/  0/  0] r0 ()
  7709167   25( 7.2)  37.407 -> 1022(159.4)   65101294 19.384( 3.641)  +   0.006 =  81% (13.6) [  0/  0/  0] r0 ()
  7709168   36( 8.7)  22.958 -> 1321(225.5)   65111031 28.176( 5.050)  +   0.003 =  92% (18.2) [  0/  0/  0] r0 ()
  7709169   22( 9.5)  35.419 -> 1235(287.2)   65027410 122.143(11.157)  +   0.033 =  95% (23.0) [  0/  0/  0] r0 ()
  7709170   34(10.9)  143.230 ->  652(319.9)   65181729 11.763(11.745)  +   0.197 =  37% (24.9) [  0/  0/  0] r0 ()
  7709171   142(17.7)  17.811 ->  654(351.1)   65342884 74.501(15.428)  +   0.046 =  94% (29.0) [  0/  0/  0] r0 ()
  7709172   17(18.2)  80.691 ->  654(383.9)   65342884 24.169(16.637)  +   0.023 =  81% (33.1) [  0/  0/  0] r0 ()
  7709173   80(21.9)  29.061 ->  684(418.1)   65276501 29.048(18.089)  +   0.006 =  87% (37.4) [  0/  0/  0] r0 ()
  7709174   28(23.0)  31.720 ->  710(453.6)   65053869 54.125(20.795)  +   0.005 =  97% (42.3) [  0/  0/  0] r0 ()
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709175   31(24.2)  66.002 ->  710(488.8)   65024668 29.351(22.260)  +   0.075 =  72% (45.9) [  0/  0/  0] r0 ()
  7709176   65(27.2)  34.031 ->  710(524.2)   65335032 55.021(25.011)  +   0.006 =  93% (50.6) [  0/  0/  0] r0 ()
  7709177   33(28.6)  66.113 ->  740(561.2)   65120336 14.889(25.756)  +   0.008 =  59% (53.5) [  0/  0/  0] r0 ()
  7709178   65(31.5)  22.919 ->  734(598.0)   65324306 20.057(26.758)  +   0.021 =  74% (57.2) [  0/  0/  0] r0 ()
  7709179   22(32.3)  25.699 ->  736(634.8)   65077924 30.291(28.273)  +   0.011 =  86% (61.6) [  0/  0/  0] r0 ()
  7709180   25(33.2)  34.337 ->  651(667.0)   65132172 25.147(29.528)  +   0.004 =  88% (65.9) [  0/  0/  0] r0 ()
  7709181   33(34.6)  33.710 ->  603(696.9)   65353411 27.246(30.888)  +   0.005 =  79% (69.8) [  0/  0/  0] r0 ()
  7709182   33(36.0)  45.166 ->  603(727.0)   65348874 25.336(32.154)  +   0.057 =  59% (72.8) [  0/  0/  0] r0 ()
  7709183   44(37.9)  38.478 ->  603(757.1)   65351449 27.391(33.524)  +   0.029 =  69% (76.3) [  0/  0/  0] r0 ()
  7709184   37(39.2)  33.069 ->  603(787.3)   65358283 29.154(34.981)  +   0.009 =  87% (80.7) [  0/  0/  0] r0 ()
  7709185   32(40.5)  36.877 ->  603(770.0)   65361309 28.131(35.341)  +   0.012 =  80% (80.7) [  0/  0/  0] r0 ()
  7709186   36(42.0)  32.898 -> 1070(764.9)   65035483 24.044(34.968)  +   0.005 =  86% (80.3) [  0/  0/  0] r0 ()
  7709187   32(42.4)  26.881 -> 1671(797.4)   65080484 32.178(35.608)  +   0.016 =  94% (81.0) [  0/  0/  0] r0 ()
  7709188   26(41.9)  39.221 -> 1671(814.9)   65080484 18.501(35.124)  +   0.050 =  75% (80.1) [  0/  0/  0] r0 ()
  7709189   38(42.6)  46.131 ->  499(778.0)   65642797 36.767(30.855)  +   0.011 =  58% (78.2) [  0/  0/  0] r0 ()
  7709190   45(43.2)  56.552 ->  485(769.7)   65598691 18.171(31.176)  +   0.005 =  49% (78.8) [  0/  0/  0] r0 ()
  7709191   55(38.9)  31.413 ->  484(761.2)   65591264 14.048(28.153)  +   0.022 =  54% (76.8) [  0/  0/  0] r0 ()
  7709192   30(39.5)  21.235 ->  484(752.7)   65587795 12.058(27.548)  +   0.006 =  67% (76.1) [  0/  0/  0] r0 ()
  7709193   20(36.5)  14.711 ->  484(742.7)   65589535 12.031(26.697)  +   0.003 =  89% (76.2) [  0/  0/  0] r0 ()
  7709194   14(35.8)  18.415 ->  484(731.4)   65590781 10.169(24.499)  +   0.059 =  64% (74.6) [  0/  0/  0] r0 ()
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709195   12(34.9)  27.489 ->  484(720.1)   65592283 20.771(24.070)  +   0.028 =  65% (74.2) [  0/  0/  0] r0 ()
  7709196   14(32.3)  45.895 ->  485(708.9)   65600037 11.029(21.870)  +   0.022 =  48% (72.0) [  0/  0/  0] r0 ()

blocks continue to be full (about 65M computrons each) until block 7709257, where the queue is finally drained and we're able to catch up:

- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709197   39(19.8)  22.831 ->  484(363.2)   65596431  9.008(12.744)  +   0.005 =  64% (60.9) [ 68/  0/ 68] r3 (BB5E2C)
  7709198   22(20.2)  15.073 ->  561(402.8)   65563866 21.843(14.564)  +   0.003 =  80% (64.9) [ 72/  0/ 72] r2 (BB5E2C)
  7709199   14(19.2)  24.778 ->  619(438.8)   65005698  8.583(13.567)  +   0.003 =  82% (67.8) [ 65/  0/ 65] r1 (BB5E2C)
  7709200   19(19.1)  16.110 ->  617(464.3)   65285790  8.785(12.884)  +   0.006 =  83% (70.0) [ 73/  0/ 73] r0 (2A3AC9)
  7709201   10(18.0)  20.463 ->  619(483.6)   65005545  8.849(12.380)  +   0.027 =  61% (68.9) [ 71/  0/ 71] r0 (CFC45F)
  7709202   15(17.7)  15.666 ->  618(498.6)   65290598  8.964(12.000)  +   0.003 =  87% (71.0) [ 71/  0/ 71] r1 (CFC45F)
  7709203   10(16.9)  15.298 ->  619(510.6)   65005683  8.846(11.685)  +   0.022 =  93% (73.2) [ 73/  0/ 73] r0 (E29E61)
  7709204   10(16.3)  20.500 ->  617(520.3)   65284344  8.575(11.402)  +   0.015 =  57% (71.7) [ 71/  0/ 71] r0 (23E1B0)
  7709205   14(16.1)  15.608 ->  619(528.5)   65006237  9.291(11.226)  +   0.003 =  94% (73.7) [ 71/  0/ 71] r1 (23E1B0)
  7709206   10(15.6)  29.852 ->  818(550.8)   65119238 25.071(12.291)  +   0.015 =  62% (72.8) [ 72/  0/ 72] r0 (47C2F8)
  7709207   15(15.6)  49.541 -> 1002(583.0)   65108717 28.765(13.468)  +   0.049 =  75% (73.0) [ 67/  0/ 67] r1 (BF9BF3)
  7709208   32(16.7)  46.817 -> 1005(611.1)   65016753 18.499(13.803)  +   0.047 =  97% (74.6) [ 65/  0/ 65] r2 (BF9BF3)
  7709209   33(17.7)  48.654 -> 1006(635.8)   65012879 35.034(15.130)  +   0.011 =  68% (74.2) [ 72/  0/ 72] r1 (996453)
  7709210   28(18.3)  56.596 ->  966(655.2)   65178791 27.960(15.885)  +   0.020 =  96% (75.5) [ 70/  0/ 70] r2 (996453)
  7709211   40(19.5)  45.159 ->  924(670.2)   65002751 10.241(15.571)  +   0.004 =  94% (76.6) [ 70/  0/ 70] r1 (BB5E2C)
  7709212   35(20.3)  21.085 ->  924(683.5)   65004071 10.268(15.292)  +   0.003 =  93% (77.5) [ 71/  0/ 71] r1 (BB5E2C)
  7709213   12(19.9)  40.445 ->  542(676.5)  250284328 44.655(16.760)  +   0.006 =  67% (77.0) [ 73/  0/ 73] r0 (9A4CCB)
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709214   15(19.9)  71.403 ->  416(697.2)   65365580  8.509(16.677)  +   0.059 =  86% (78.2) [ 71/  0/ 71] r1 (026653)
  7709215   49(21.8)  39.396 ->  415(693.8)   65358871  7.060(15.992)  +   0.003 =  45% (77.2) [ 76/  0/ 76] r1 (026653)
  7709216    9(21.6)  37.957 ->  415(690.3)   65357798  7.450(15.813)  +   0.004 =  93% (79.4) [ 72/  0/ 72] r0 (3B6773)
  7709217    8(20.0)  37.968 ->  415(686.9)   65358636 20.962(16.410)  +   0.045 =  97% (81.0) [ 76/  0/ 76] r0 (CFC45F)
  7709218    9(19.4)  52.596 ->  408(679.2)   65059232  5.153(15.576)  +   0.008 =  66% (80.3) [ 67/  0/ 67] r0 (E34C31)
  7709219   24(19.9)  42.313 ->  405(668.5)   65404941  4.542(15.374)  +   0.006 =  34% (77.9) [ 75/  0/ 75] r1 (E34C31)
  7709220    6(19.2)  41.316 ->  405(657.9)   65404691  4.611(15.165)  +   0.009 =  91% (78.3) [ 77/  0/ 77] r0 (E14F81)
  7709221    7(19.1)  43.266 ->  405(647.2)   65403452  4.622(14.954)  +   0.005 =  51% (77.8) [ 77/  0/ 77] r0 (27C9BF)
  7709222    6(18.6)  42.273 ->  405(636.5)   65404398  4.466(14.729)  +   0.005 =  92% (78.1) [ 76/  0/ 76] r0 (E29E61)
  7709223    6(18.4)  41.145 ->  940(652.6)   65072377 19.573(15.265)  +   0.003 =  97% (78.3) [ 77/  0/ 77] r0 (2A3AC9)
  7709224    6(18.2)  57.550 -> 1238(683.6)   65004024 12.381(15.456)  +   0.003 =  81% (79.5) [ 71/  0/ 71] r0 (BB5E2C)
  7709225   18(18.4)  56.115 -> 1405(723.0)   65005301 11.064(15.544)  +   0.021 =  72% (78.4) [ 70/  0/ 70] r0 (D97CBE)
  7709226   18(18.8)  52.699 ->  551(709.6)   65016601  6.953(14.638)  +   0.006 =  66% (78.6) [ 78/  0/ 78] r1 (D97CBE)
  7709227   10(18.6)  50.483 ->  336(676.3)   65316156  5.489(13.475)  +   0.017 =  86% (79.2) [ 74/  0/ 74] r0 (A21688)
  7709228    7(17.3)  49.547 ->  336(642.9)   65315525 15.434(13.321)  +   0.005 =  96% (79.2) [ 77/  0/ 77] r0 (A177B0)
  7709229   12(16.2)  55.343 ->  337(609.4)   65321205  5.984(11.869)  +   0.015 =  71% (79.3) [ 61/  0/ 61] r1 (765A78)
  7709230   19(15.8)  42.847 ->  336(577.9)   65315624  5.864(10.764)  +   0.004 =  92% (79.1) [ 69/  0/ 69] r1 (765A78)
  7709231    8(14.2)  41.466 ->  385(551.0)   65191168  4.897(10.497)  +   0.009 =  86% (78.7) [ 82/  0/ 82] r0 (BB89FD)
  7709232    6(12.8)  43.377 ->  390(524.2)   65708481  4.795(10.223)  +   0.005 =  61% (77.1) [ 79/  0/ 79] r0 (0DB33A)
  7709233    6(12.4)  55.678 ->  391(516.7)   65714431  5.134( 8.247)  +   0.029 =  27% (75.1) [ 80/  0/ 80] r0 (CFC45F)
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709234    7(12.1)  54.316 ->  391(515.5)   65713231  4.657( 8.055)  +   0.003 =  90% (75.3) [ 71/  0/ 71] r0 (5F329F)
  7709235    6( 9.9)  53.643 ->  390(514.2)   65708086  4.610( 7.932)  +   0.012 =  87% (77.3) [ 80/  1/ 81] r0 (843872)
  7709236   11(10.0)  47.731 ->  943(540.6)   65126968 10.045( 8.062)  +   0.006 =  95% (77.4) [ 67/  0/ 67] r1 (843872)
  7709237    8(10.0)  50.380 -> 1287(584.2)   65398089 22.792( 8.153)  +   0.004 =  97% (77.4) [ 69/  0/ 69] r0 (BB5E2C)
  7709238    8( 9.9)  73.646 -> 1642(645.9)   65050770 10.633( 8.427)  +   0.016 =  55% (76.9) [ 75/  0/ 75] r0 (C22CB0)
  7709239   20( 9.8)  65.480 ->  336(642.5)   65445662  6.789( 8.540)  +   0.057 =  85% (79.4) [ 73/  0/ 73] r0 (06AB35)
  7709240   15(10.2)  58.000 ->  336(639.0)   65314320  5.479( 8.583)  +   0.004 =  89% (79.3) [ 75/  0/ 75] r1 (06AB35)
  7709241    7(10.2)  57.175 ->  336(635.5)   65313635  6.510( 8.678)  +   0.030 =  90% (81.3) [ 75/  0/ 75] r0 (9E6A8A)
  7709242    6(10.2)  58.190 ->  336(632.1)   65313202  6.688( 8.789)  +   0.007 =  93% (81.3) [ 75/  0/ 75] r0 (E29E61)
  7709243    6(10.2)  61.644 ->  342(602.2)  278147375 21.960( 8.908)  +   0.005 =  88% (80.9) [ 75/  0/ 75] r0 (A9F209)
  7709244    7(10.2)  77.019 ->  386(559.6)   65570782  4.479( 8.513)  +   0.029 =  91% (81.4) [ 74/  0/ 74] r0 (2A3AC9)
  7709245   21(10.4)  71.150 ->  390(508.9)   65708256  4.443( 8.182)  +   0.023 =  29% (79.2) [ 74/  0/ 74] r1 (2A3AC9)
  7709246    6( 9.8)  70.097 ->  390(500.8)   65707574  5.156( 8.092)  +   0.016 =  91% (80.5) [ 75/  1/ 76] r0 (BCF291)
  7709247    7( 9.7)  73.149 ->  390(503.5)   65707249  4.486( 8.042)  +   0.005 =  47% (78.5) [ 76/  0/ 76] r0 (CFC45F)
  7709248   11( 9.8)  67.156 ->  391(506.2)   65714344  4.719( 7.506)  +   0.005 =  90% (78.2) [ 79/  0/ 79] r1 (CFC45F)
  7709249    6( 9.6)  78.223 -> 1131(546.0)   65078323 10.731( 7.743)  +   0.003 =  46% (77.0) [ 78/  0/ 78] r0 (BB5E2C)
  7709250    6( 8.9)  86.226 -> 1339(596.1)   65038299 11.131( 8.007)  +   0.003 =  77% (76.2) [ 70/  0/ 70] r0 (8BA38C)
  7709251   11( 9.1)  87.041 -> 1554(654.5)   65029239 22.834( 8.904)  +   0.005 =  97% (76.7) [ 73/  0/ 73] r0 (0DB33A)
  7709252   10( 9.2)  100.462 -> 1399(705.0)   65051279 12.786( 9.303)  +   0.034 =  95% (78.4) [ 73/  0/ 73] r0 (0FED4E)
  7709253   22(10.1)  91.695 -> 1114(741.1)   65069571 20.106(10.052)  +   0.017 =  97% (82.0) [ 73/  0/ 73] r1 (DC4CCA)
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709254   16(10.5)  96.330 -> 1409(792.0)   65086102 11.202(10.379)  +   0.017 =  95% (82.2) [ 73/  0/ 73] r1 (1CE894)
  7709255   21(11.2)  87.210 -> 1485(846.8)   65039781 12.390(10.768)  +   0.027 =  94% (82.6) [ 74/  0/ 74] r1 (1CE894)
  7709256    9(11.2)  91.243 ->  941(846.7)   61174777 22.553(11.393)  +   0.003 =  97% (82.7) [ 74/  0/ 74] r0 (4EBC59)
  7709257   11(11.3)  103.351 ->    6(782.6)     219917  0.040(10.256)  +   0.025 =  10% (78.3) [ 73/  0/ 73] r0 (E29E61)
  7709258   39(12.8)  64.821 ->    0(700.5)          0  0.002( 9.724)  +   0.005 =   1% (75.6) [ 80/  2/ 82] r3 (32B4F1)
  7709259    6(12.2)  59.245 ->    0(683.8)          0  0.002( 9.385)  +   0.005 =   1% (71.4) [ 82/  0/ 82] r0 (E29E61)
  7709260   11(11.9)  48.725 ->    6(667.2)     188192  0.037( 9.113)  +   0.008 =   8% (67.4) [ 82/  0/ 82] r1 (E29E61)
  7709261    6(11.9)  43.447 ->    0(650.5)          0  0.002( 8.787)  +   0.003 =   0% (62.9) [ 82/  1/ 83] r0 (BB5E2C)
  7709262   11(12.2)  32.892 ->    0(633.6)          0  0.002( 8.453)  +   0.003 =   1% (58.3) [ 83/  1/ 84] r1 (BB5E2C)
  7709263    6(12.2)  27.301 ->    0(616.5)          0  0.002( 7.355)  +   0.009 =   2% (54.0) [ 84/  0/ 84] r0 (CFC45F)
  7709264    6(12.1)  21.721 ->    0(597.2)          0  0.002( 7.131)  +   0.003 =   1% (49.5) [ 85/  0/ 85] r0 (843872)
  7709265    6(11.3)  16.118 ->    6(578.0)     175502  0.034( 6.911)  +   0.015 =  11% (48.6) [ 85/  0/ 85] r0 (A21688)
  7709266    6(11.3)  10.625 ->    0(558.5)          0  0.002( 6.653)  +   0.003 =   1% (44.1) [ 85/  1/ 86] r0 (026653)
  7709267    7(11.3)  11.318 ->    6(539.4)     194537  0.337( 6.446)  +   0.019 =   4% (41.9) [ 86/  0/ 86] r0 (2A3AC9)
  7709268   10(11.3)   7.278 ->    6(520.1)     175502  0.042( 6.212)  +   0.004 =   0% (37.4) [ 86/  0/ 86] r1 (2A3AC9)
  7709269    6(11.3)  11.994 ->    6(463.9)     188192  0.043( 5.678)  +   0.023 =   0% (35.1) [ 87/  0/ 87] r0 (0CCD5C)
  7709270   11(11.6)   6.819 ->    6(397.2)     175502  0.040( 5.123)  +   0.007 =   0% (31.3) [ 87/  0/ 87] r1 (0CCD5C)
  7709271    6(11.3)   6.994 ->    6(319.8)     175502  0.043( 3.983)  +   0.005 =   0% (26.5) [ 87/  0/ 87] r0 (0DB33A)
  7709272    6(11.1)   7.102 ->    0(249.8)          0  0.003( 3.344)  +   0.004 =   0% (21.7) [ 87/  0/ 87] r0 (5F329F)
  7709273    6(10.3)   6.982 ->    0(194.2)          0  0.003( 2.339)  +   0.006 =   0% (16.8) [ 87/  1/ 88] r0 (23E1B0)
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709274    6( 9.8)   7.113 ->    0(123.7)          0  0.006( 1.779)  +   0.003 =   0% (12.0) [ 88/  1/ 89] r0 (BB5E2C)
  7709275    6( 9.1)   7.395 ->    0( 49.5)          0  0.002( 1.160)  +   0.003 =   0% ( 7.3) [ 89/  0/ 89] r0 (3B4946)
  7709276    6( 8.9)   7.408 ->    6(  2.7)     181847  0.045( 0.035)  +   0.006 =   0% ( 2.5) [ 89/  0/ 89] r0 (A177B0)
  7709277    6( 8.7)  12.279 ->    6(  2.7)     175502  0.076( 0.036)  +   0.023 =   0% ( 2.0) [ 89/  0/ 89] r0 (CFC45F)

We'll start by examining block 7709165 (which recorded 240M computrons), and 7709213 (250Mc), and 7709243 (280Mc). It's possible that the Big JS Hammer evaluation included a single delivery that took a very long time to complete (not yielding it's time slice until well beyond the per-block limit of 65Mc), followed by a long run-out of additional work (which was properly segmented into separate blocks). And maybe this happened three times.

warner commented 1 year ago

We've identified three potential issues:

dckc commented 1 year ago
  • I'm worried that we're somehow iterating through all registered/provisioned vat-wallet accounts and notifying everybody that they have new zero-balance Purses available

Yes... each smart wallet is waiting for new vbank assets such as DAI_axl and DAI_grv:

https://github.com/Agoric/agoric-sdk/blob/2c812d22161cd297587979b262eab6e2cc76e23d/packages/smart-wallet/src/smartWallet.js#L523-L524

notifiers / subscriptions adds a new wrinkle to the "only do work proportional to the size of the incoming message" design constraint.

cc @michaelfig @turadg @erights

dckc commented 1 year ago

voting period ended 2022-12-06T18:31:37Z

proposal 17 on NG has a pretty clean view of the code that started running. Or try agd --node=https://main.rpc.agoric.net:443 query gov proposal 17.

6 new vats: 3 ea. for DAI_axl, DAI_grv

  1. startPSM.makeAnchorAsset(noMinting, { options: { anchorOptions } }) starts a mintHolder
  2. startPSM.startPSM(permittedPowers, config) starts a contractGovernor
  3. " starts a psm

2 new purses for each of ~400 wallets

analytics.inter.trade shows over 400 smart wallets provisioned. Each of them was waiting to hear about new vbank assets:

https://github.com/Agoric/agoric-sdk/blob/2c812d22161cd297587979b262eab6e2cc76e23d/packages/smart-wallet/src/smartWallet.js#L523-L524

block times spike but eventually recover

nifty chart of block times and missing validators (ack: polkachu)

image

warner commented 1 year ago

The Munin graphs on a follower node show the state-size growth when the proposal was executed:

6 new XS heap snapshots, one per new vat:

agd_swingset_xs_snapshots-day

but the new vats are small, so the total size consumed by heap snapshots only grew by about 20MB (compressed, stored as plain files):

agd_swingset_xs_size-day

The initial/startup transcripts for those new vats consumed about 150MB (uncompressed, stored in SQLite) (note: the transcript space normally grows by about 40MB per day)

agd_swingset_streamstore_size-day

The kvStore size grew by 12MB: agd_swingset_kvstore_size-day

(for comparison, on a node that prunes rather aggressively, the cosmos-side state directory normally grows by about 780MB/day)

warner commented 1 year ago

The blocks that included a startVat consumed a lot of computrons (which the runPolicy didn't correctly account for), but they didn't actually require a lot of runtime.

The first block was 7709165, which handled the proposal execution. It consumed 240M computrons ("240Mc"), and 20.942s of swingset execution time. The two startVats inside it each took 88Mc and 1.8s. The runPolicy bug meant the startVats only counted 300kc towards the block's 65Mc limit, so the remaining 64.4Mc were spent on other deliveries (which required 17.34s). That's not great but not too bad, leading to a block time of maybe 25s-27s (which, because of the way blockTime is assigned to a later block, appears in my output two blocks later, as the 7709167 25 line)

- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709165    6( 6.2)   7.069 ->  950( 50.1)  239910012 20.942( 1.099)  +   2.592 =  79% ( 4.9) [  0/  0/  0] r0 ()

  7709166    6( 6.2)  26.562 -> 1171(108.7)   65050750 31.495( 2.674)  +   0.021 =  94% ( 9.6) [  0/  0/  0] r0 ()
  7709167   25( 7.2)  37.407 -> 1022(159.4)   65101294 19.384( 3.641)  +   0.006 =  81% (13.6) [  0/  0/  0] r0 ()

This is followed by a series of blocks which each use the full 65Mc. The amount of wallclock time they take varies, however, from as little as 4s, to a lot of 20-30s blocks, a pair of 55s, and a single 122s outlier.

  7709168   36( 8.7)  22.958 -> 1321(225.5)   65111031 28.176( 5.050)  +   0.003 =  92% (18.2) [  0/  0/  0] r0 ()
  7709169   22( 9.5)  35.419 -> 1235(287.2)   65027410 122.143(11.157)  +   0.033 =  95% (23.0) [  0/  0/  0] r0 ()

Each block is using the expected 65M-ish computrons, but some kinds of deliveries take more wallclock time than others (we always hoped the computrons/second ratio would be roughly constant; this is what happens when it is not, and when our mix of delivery types is not random enough to even things out). So some blocks take a lot more time than others, and of course we can't just end a block after five seconds without losing determinism.

I'm still digging into the longer blocks. I think we're seeing the same "long time to serialize" problem that we observed in an earlier testnet, which we thought we fixed by updating balances less frequently. But if we're updating a whole bunch of wallet balances at the same time, we're going to be performing the same kind of serialization in a row, which exacerbates the uneven computrons/second ratio problem.

p.s. see

warner commented 1 year ago

BTW, the total proposal execution (which introduced two denoms) required:

We're guessing this is proportional to (number of denoms introduced) x (number of provisioned wallet accounts). So one (slight) mitigation is to only add one denom at a time, which would give the chain more time to recover before taking on the second half.

With 410 wallets provisioned, the execution requirements for a single denom will probably be:

per provisioned wallet.

mhofman commented 1 year ago

FYI, I've confirmed a slowdown of XS workers over time, which probably contributed to this issue. This specific degradation should not occur and we're investigating it. That said it's not a cause in this case, and probably just exacerbated the core problems identified above.

dckc commented 1 year ago

We've identified three potential issues:

I recall another from our discussions:

  1. The per-block limit of 65Mc was designed to represent 30 seconds of wall-clock time, but in one case it lasted 122.143 sec, which caused something of a storm in the consensus protocol.
    • While it's somewhat unsurprising that computrons don't faithfully represent wall-clock time, the hope that some random variation would smoothing things out didn't pan out.
    • Do we have significant gaps in our cost accounting? (chainStorage writes come to mind)
warner commented 1 year ago

We had a postmortem meeting today, some of the conclusions:

The slowdown was caused by a multiplication of several factors, all of which are worth investigating and fixing:

To address the per-wallet work, we need to make things more lazy, and/or look for ways to just do less work per vat.

To address the XS metering precision, we should sort the cranks by their wallclock / computrons ratio, then look carefully at the biggest vs the smallest, and try to figure out which low-level JS operations were involved in each. Perhaps we could run a lot of the same-sized deliveries in a loop, while we point a profiler at the worker, and build up a flamegraph of which XS .c functions get called during each. If we're lucky there may be a few specific XS functions that account for lots of wallclock but not a lot of computrons, and we can change their metering behavior to increase the computron consumption. Likewise there might be code that increments the meter a lot but actually runs very quickly, which could use adjusting in the opposite direction.

The breadth-first scheduling doesn't have an obvious fix. When we have a more sophisticated kernel scheduler, we might be able to do better, but the small number of vats involved means the kernel wouldn't have a lot of criteria with which it could distinguish between work done for user 1 vs user 2, etc.

We need to replay these vats and add GC instrumentation measure how frequently organic GC is happening, how long it takes, and whether this time was a significant contributor to the wallclock/computron variance. We might address this through XS changes, but my hunch is that finishing virtualization of the vats involved will make the most difference, to shrink the JS heaps and keep GC fast.

Anything we do to speed things up or do less work will help with problem 1 (the chain being unavailable for 35 minutes). Users care more about that.

Anything we can do to make execution more uniform (and make the per-block computron limit more effective) will help with problem 2 (the handful of long blocks, causing voting/consensus/stability problems). Validators care more about that.

mhofman commented 1 year ago

The breadth-first scheduling doesn't have an obvious fix. When we have a more sophisticated kernel scheduler, we might be able to do better, but the small number of vats involved means the kernel wouldn't have a lot of criteria with which it could distinguish between work done for user 1 vs user 2, etc.

I believe @dtribble's suggestion to push new work generated by a vat on the top of the runqueue instead of the bottom would have effectively prevented this breadth-first issue.

dckc commented 1 year ago

historical note: it looks like we knew about #6652 as far back as May 2022

We just didn't schedule that work for the release.