[Wax] Node doesn't sync

WhoSoup commented 3 years ago

This has taken me a while to confirm due to syncing being so incredibly slow but it appears that nodes will not fully sync using the FD-1225_wax_rollup branch. This is the third-ish time I've tried, the first time I assumed it was a fluke, the second time I thought it was related to my EntrySync rework, but now it's happened again in the rollup branch.

The node reaches a certain height (first pass was at 99%: 81861 / 82065, exact height varies) and then seems to stall with the console spammed full of p2p overflow messages:

 last message type: {"Message":"EgF0lgDbFgBO8SMtWhJuj12F13mwbL5bxmloH+K4qGKHGHimFb2grQDb0HdoRz6aGb9CbLhBpa8ZZdCAftKi+D3wqq4pOdFchQAAPKhrHLgRWqDRX1zE5Ym5D8tWRwDylz2Rb8CxUP+fm3w3UQSqrLL7crb8fM3+bT/FbJ6o9zSovI/Q8/5PsGcb9AFl9nDCxw3zw+EdepIQEaRRyg8J7Q1c+Gvgxtq2gXVrG/F8c10eyr4GKxQIJ0rzsyO86xYc+qr+wtB2IQWYN8oYg35dFtl2JY/M51HX4u9CFjz8fXbweMbSE6ebajw0vpQbYy7ufObxgNHypEdGHtUQaW01BZgckByoWDcY00V5NsGg6JvOWT1bgh1B3zHIXhRs+YQe2jF9j2Fero3tQLX8hTZnB4SyJ1OCqeiUH2YO/BRvnH4YOXzPI9Otm/AlRmrB+MzFIxdYkDkKdDpOlAFKfV8znbeKvenouUM7E2inALsH9oOl+NxI30VEn3NIvte8UlQu9cIXTznHQAqrbdDsVm+ZfCVO7UgiKqUzr4aDTYL+WLJ2LVSLsmMmKvMXGPaUvWVlPYsySf7dbcbh9usc8H3AYoca39QMOfXlWjzDinAd/s3qC/+XSh5Uus8+30TFXwLSRwRFhEfLezrq3yveknv3Ml2ePUlwW2sxPl6ecVlRT9g6HiSkeZso0rIgQJ+PI8D9kjufdWa2Kf+dazjISoZN00DnyWa/Q9p1lDcgAR7lBfo94PqVdrTerq+k8lwbZMi64lV+7tEziGtrnyG0d+eAkAfDKhLq2v7pfkuub88XbLKYQ6rR3r93y+ZNmNm0BlaIWjH1oUPvGubUYjL4XQjkUDEBuutBSqZY5fQ7TiwPW4f2zoSGf+xWqSZbuQy2etJxDlDp8+kZtf3W8LTlVQgzKcPIMMhTSyjUhIH8Uf/g8y6eDdm/23qdfEictgLDxKXAoRUQD6g1S0r3Yrxq/GGTtbHovKGILo0oLfEZE+oNXZc7k8k10uOxPK833MW9s0eeJi0vYDhAIt9Kvwy16xe+3yvSjNwT1BjV5E96lD9RLQeopg317ZwqaipXdl2IfM6f1Zwx+j8idjv3jJELDOu7K+GzfmhRMB8LtFVTlTahW7dPNQUAK4pHZ988i5wxb25rhIpAvKzpYLO5B9CiPD6MQ1tNwChFnKi9sC+Wee+36z+MTa5iKU4dUPGrJeMxlIB3YU2vx3W/lS8GbRhEzNQzX/0kWPZCfV8h9/aFLZrqT1BvFx8Lbogv+tdSkxX3sl7P3TShSFKyqWWDKqHAQzn6Sl4cUQPW6qYzL0WiFRVNTE4szewx8wReii3zw8IrtuFP9N94jKo69mNZXonMqb7E57b8t6peNG20sELTOG/3g37qaHl7WiP4FPAwiZyExTaiITlU7FDkAAGtQaPM3ua6Er87Znl8/Wxe3VSaafgyV+d3fSb1cep6nsZejw==","PeerHash":"xxx:8110 38bab145","AppHash":"","AppType":""}  component=networking package=engine subpack=p2p-proxy
WARN[166592] nonBlockingChanSend() - DROPPING MESSAGES. Channel is over 90 percent full!
 channel len:
 4751
 90 percent:
 4750
 last message type: {"Message":"EgF0lgDbFwBjDtvIhlD1ZZk+Iw/v2pRPDlTDG2qzR/O+6DMlzMsI5wDellGWrpL6WwEAP82w5b/bRdeUrtOwZpzpRse7g7p8lQAA6GiQqzle40nAjBVO14uoVYB5Cr1sy5A/ovPnd68H6bLJVYKVEeK/O41/LCGbMvdhtZQpr6iqTO909NHuQwpAOUk4L3cN4tbwds9lKhyXwNo6Cto/mWYbLy2v0f04pa2yKnTWwySIP/hfS+NicBrkJF42w9PFx6IO4LvoFuf1nzySNht1xNc5gFZ8NoDRAS6dBcgVRyQxoSTK955HvyO0Tn3fSPV43R9Wse+noionvz4H4hUSAlZ7kUOIJv1K+7VMop3LUu9o1P3vR+ujp1dkG0D8q6G6Yqy+BTuXVrqC+gnymSVQSYAvv3y0SvMQZ2aUN0uFJLYRElaZqDQAMjsaf26114BTcnCc2g1SugKiarlyprMoFdcgCTMCu4BvedjVbkFAjM16tbijvka6KweU6IIWDT+hofmftd4LmkdG8LKDH0V5FT13E1wtriFl00eyYtmncRpjcY3k2/8aIvKgIkb5Rd+5XLqQpk1AlRJt2Fqrvgnokx++ZYNI42DrhUgYBR0KWNjYwCAD6k2RTlKS4QcvmYRMwOsuRy9VAZTx0vHY6Kmn5fnTlHdijYqWHlJPBdDzYyFyMINXLcifNfotkJV6u/qwDT4Ir6cDDE5W31JK+WQC2B5jB4L8aKKyavku+DzXFHqnmsLu4Zxr1ioWafi/FB3xujwmh0HHqRgXELDFxeoXiHI2opPixVclFvnvvlxpMq27Er1cY3Ac7AD49kTPabEZSaMSuo1mVnZyfU0EW1ra8OvMyYGb+/R6RXsdZpwYTpy0OMft0sxpWr94q0eIvvNGUW968cpm3jNtAT8AEH7WPgIKagVFFp6jJAD5D8r1bdIlaGMRiHKcLrTFBPp8Whum8dVTV2NbWus+gOlb3GIW+cM29YW+NPPR7K1qymuM7YKkZbEpmqAx6UtfGNqcE469JkRfhNGSBovTOI5HOf9GixC3+mdH6zCmeQ16wURY/hnO0y2mipfQWyN0Nkis1FW5daif1I0AhoDqvKn1eCNQgmjJ+g76xZm8YgJgxxxVxHf6bcvT8l3AqNgx0YRJf+IN67qGagYc/GKVNFmATtWqTrB5s3wRerD4Rl4yo+x4HaSbtp+7bSajm6KJWn/s9RynYph/iW5BRsVgjB2i0qq7Q65IHQNwi86cFSiDwQkt0J8cxtw1Sb6KGVd7XA/3tZBHmBQFHMvaBvuWuqkW8JEFf6KOCC/Wvg9CLRaLTbhWcbcR+0dff5Jl95N/D43Dv8LxxLxfyHzQdYCn1nLORNclVGEbmk9EXSCckeiaRGbaJaeNaWQbVK17cTm2ieulxGsCdeIKhHvySSHx3sCtJ6o6HUqPB16ALEoPoivOu4pIhOHLP5EscmoOBVN7YQ==","PeerHash":"xxx:8110 38bab145","AppHash":"","AppType":""}  component=networking package=engine subpack=p2p-proxy
WARN[166592] nonBlockingChanSend() - DROPPING MESSAGES. Channel is over 90 percent full!
 channel len:
 4751
 90 percent:
 4750
 last message type: {"Message":"EgF0lgDbRgCp3NkJiVIdy62LkLvH9L6RaX3mkHpJ5PZeORI0nE437ADwwveJ0MNi4e371boZ0szod+pCLlJvfmLJ2soMpjIUDAAACwMyBOQSudyxiggsdfz97o5XUgpR4uFvJ3XRma5hNz5iDX0My2IyH+e+Trw98EQ70O3kEL5lGxGjx5cztupmaeURcT6WHZKVr2Cmeh54NiuKz19PWEIiGqqJokFF1BK6+JF8uXxnmrTQx6quVkP8hap0mZ8Qdh7lN4YRFsrG4oPpy4bLuFVeibe/MTzr36cV6Y8Dr1PyPIM+Q1SushxJ5lDSPQFfbO7RXhKGMI1rkAc4yViy/D4VRirMfBfzvGwGVUZeiW4ggS/VqttmLJUAtCru2TUt37aLbgOmYm96OdBt9QX2Q747RnSLebMGPTop6OO6NkOG4N+e2yudJK9If83ZoNgUQUEWmxdYc3Hg/5Z90F1TniNK8AEfdrugQY9OJfEzGodL0cjxnvZ8ZcSueB2xkdt+8SKUxHPw8io5yBPxlhrMRGecT8YtXOE04hN4oFkZ+5KQr0omTeoNkGZu4799SZsng5vxjRZuoP7kG6PKGDsyoITqdH4UdLqjSwzcawGdL5/IWPPwOM5aD6OBf7GDSD5s/FUpvy2jCqIsedH4t49wAc5BqLw8vmOiV4sSOHfc4JJbXB10a8qSikQzrfdEV4lUhJm4Fq6uhWlLuB3noYfj3mqfWABcesE4xHQqfeKnCCiGQiMHNkXx9mWaMJMio9nR+NbHNCb6FdsfCWIU9QnOfDdNwsxE1OMSBLWq6OV9rytPbTmhlPECWieSV91Rwxb/batU7HDgnJ9Ji4FJnbwpsYqlHWiDH+OsZ1JdApQORXhFlY/X4N4sWvwydYfwlPJgXBvY+tmqC2F3ONFRD33BD98NrGjrPQSHwIy+tBneU7WJnpFfcJ/gJEWsxm+meErvTHneP+XIuYYpyt9oOBerRMav0T6mJ+aF0YP5XvEJy6M2OUV8n9DT/9ixmRC9oqVM+XlOY9Zoq2wYNprZLBe+Wv917O3h9H3RcpB7hC5IZhJ25moWakzqdvsJAQZNYOrKyWmdtOPcbkf4FomONsL7fUio5iwBRwj3qo7lUhMN16qRRXSRE3dLNDMCYp7pid76gkc8geXiBpXgf7AXT8bkRVzJEuHR8P970hnOow8UvONZx1MA9LF3ZW2oJctp+U26Bp8NJRSyY0zMOoBzyDeFLOFIsZoA5VlBW0sFGWx6/UAhuT3uxht65SJ/lvMytMNUhlZ4JCtchFxaz+mHw83XrmkQt6uzYQ4Koh7mCKjmppxBRWo5dgkDgt0jnyG6CGEjvs36ZFCOgLUtgPE87LrQg5LBbfdzIl6S1LrL5lv9mRyisrtTSeAY4whSBCHqJx7FHXiZk5e3+vSq7qGV5Ywt8TkQOxVJ0L5IE6+QwD99wZL+TVXh1IydCXPoTw==","PeerHash":"yyy:8110 779bd5db","AppHash":"","AppType":""}  component=networking package=engine subpack=p2p-proxy

Those are "DataResponse" messages, trying to respond to requests from other nodes for entries from height 81862: https://testnet.factoid.org/entry?hash=4ef1232d5a126e8f5d85d779b06cbe5bc669681fe2b8a862871878a615bda0ad https://testnet.factoid.org/entry?hash=630edbc88650f565993e230fefda944f0e54c31b6ab347f3bee83325cccb08e7 https://testnet.factoid.org/entry?hash=a9dcd90989521dcbad8b90bbc7f4be91697de6907a49e4f65e3912349c4e37ec

I'm not sure why it's trying to spam send those replies, particularly for the exact height it seems to want to sync itself, but it seems suspect.

After restarting the node, it is now at 100% (82065/82065) and will not sync any further.

WhoSoup commented 3 years ago

Several attempts to restart syncing, on all branches (develop, FD-1225, and dev_wax_merge), resulted in the inability to sync (never even got to block 1). It seems there was a simultaneous testnet stall: (https://discordapp.com/channels/419201548372017163/427786712974884874/755741270328082444)

I might have to roll my own local network to be able to debug this better

WhoSoup commented 3 years ago

Thought I'd try again after the testnet restart with a dev_wax_merge build but it also stalled out:

Once again, the p2p queue was overflowing with DataResponse messages for entries around the height it was trying to sync from a multitude of different peers. I checked out the control panels of some of those nodes and they are fully synced, so they would never send those data requests to us in normal conditions.

This leads me to believe that wax somehow interprets its own requests for messages as requests coming in from other peers, meaning there could be a crossover loop somewhere that inappropriately routes outgoing messages back into the node.

WhoSoup commented 3 years ago

I found one possible source of this error: p2pproxy is being added to fnode0 twice:

https://github.com/FactomProject/factomd/blob/10afe186954161e3831064b41856cb4385fea3f0/engine/NetStart.go#L343-L344

and

https://github.com/FactomProject/factomd/blob/10afe186954161e3831064b41856cb4385fea3f0/engine/p2pProxy.go#L213-L217

This could be the reason for weird crossover behavior but I'll keep looking while I test this.

WhoSoup commented 3 years ago

The duplicate p2pproxy was not the source of the error. Still happens with only one p2pproxy.
I also looked into the dual BMVs as the cause but the MsgSplitSelector appears to work as intended and this does not duplicate messages as far as I can tell.
I reimplemented the "DataQ" (where incoming MissingData messages are filtered out early into a separate queue so they don't enter the message execution thread), which also did not solve the problem.

The biggest problem I have so far is that I can't reliably test things, as the error does not happen consistently. I created a LOCAL network over the weekend with ~27k blocks, and the error does occur when syncing with that node. It still takes quite a while to manifest itself but it's in the order of an hour, not a day as on the testnet.

WhoSoup commented 3 years ago

I believe I found the cause of the errors: queue sizes and slow processing causing a bottleneck which resulted in the network being unable to deliver inbound messages.

In the base version, the flow is something like this: network (5,000) => inmsgqueue (10,000) => msgqueue (50) => execution. In Wax, it changed to: network (1,000) => BMV in (100) => BMV out (100) => inmsgqueue (10,000) => msgqueue (50) => execution

Entry sync requested tried to request a total of 1,000 entries at a time, which meant it was possible for all requests to come in faster than it was being cleared by the BMV. In https://github.com/WhoSoup/factomd/pull/1, I adjusted the sizes of the queues. Network from 1000 to 2000 and BMV in/out from 100 to 1000. Additionally, I re-integrated the separate data queue for data requests and introduced a similar queue for data responses, making the flow:

Network (2,000) => BMV in (1,000) => BMV out (1,000) => Data Queue (10,000) => execution.

WhoSoup commented 3 years ago

The fix proposed above seems to work mostly well but it still encountered an overflow around height 126487. Once again, the queue was being overloaded incoming messages but this time, the cause is what prompted the rework in the first place: the entrysync routine just sends too many requests at a time. The way it works is that it has an internal counter for how many entries it should request at once, 750, but it doesn't split up a dblock if it has more than that.

The height 126487 isn't that arbitrary for this reason, as that is during a heavy load test. Before block 126480, the load test was around 15 EPS (~9,000 entries per block), which fit just inside the buffer of 10,000. At block 126481 and on, the EPS went to 20 (~12,000 entries per block), which the entry sync routine would try to fetch simultaneously.

This is backed up by a small debug print that printed out the number of messages it tried to send per second:

rate: 4 messages over 51.287739137s = 0.07799134967122515 mps
rate: 541 messages over 1.098852058s = 492.331952589434 mps
rate: 918 messages over 1.000021589s = 917.9801422535628 mps
rate: 3506 messages over 1.019954399s = 3437.4084361675914 mps
rate: 2127 messages over 1.049222221s = 2027.2158464224126 mps
rate: 2551 messages over 1.01068979s = 2524.0186520248944 mps
rate: 2556 messages over 1.018370165s = 2509.8927530878686 mps
rate: 2624 messages over 1.053598198s = 2490.5128780937625 mps
rate: 2732 messages over 1.007259628s = 2712.3095305824263 mps
rate: 1463 messages over 1.002419032s = 1459.4694325230498 mps
rate: 1459 messages over 1.041508923s = 1400.8520813844864 mps
rate: 643 messages over 1.00211344s = 641.6438965326245 mps
rate: 109 messages over 1.000436081s = 108.95248331611872 mps
rate: 333 messages over 1.008625851s = 330.1521429412679 mps
rate: 159 messages over 1.005653642s = 158.10611749887616 mps

(Note: this was a develop build using P2P1)

This is a problem addressed in https://github.com/FactomProject/factomd/pull/1044.

It seems there are two problems at play here: 1) the BMV bottleneck and 2) uncapped entrysync for blocks with over a thousand entries. That has made the entire debugging process extraordinarily painful.

I'm going to try and sync from scratch once again using a build that addresses both of these.

WhoSoup commented 3 years ago

This has been fixed in #1056.

FactomProject / factomd

[Wax] Node doesn't sync #1058