Cardinal-Cryptography / aleph-node-issues

Issue tracker for aleph-node related problems.
2 stars 0 forks source link

aleph-node logs failed connections to a bootnode that is *not* on the BOOT_NODES list #4

Open QNODE-CA opened 1 year ago

QNODE-CA commented 1 year ago

Did you read the documentation and guides?

Is there an existing issue?

Description of the problem

my testnet node 1 (QNODE-TEST) started logging some events yesterday (Apr-2-2023), starting at 2023-04-02 13:13:07 PDT. Events continued until the node was restarted at 2023-04-02 14:06:08 PDT. a pair of the same event was logged every 5s or so:

2023-04-02 13:13:07 💔 The bootnode you want to connect to at/ip4/84.245.33.122/tcp/30333/p2p/12D3KooWRkGLz4YbVmrsWK75VjFTs8NvaBu42xhAmQaP4KeJpw1Lprovided a different peer ID12D3KooWJgniH6kTFxLAZUVaP2nE7ohrMVjFMXBbbyph1uGYFDL6than the one you expect12D3KooWRkGLz4YbVmrsWK75VjFTs8NvaBu42xhAmQaP4KeJpw1L`

I posted about this on the discord testnet-validator-general channel: https://discord.com/channels/775720634848116807/1050508965647233055/1092203956324814848

Another validator (Bart | StakeSafe) confirmed that the IP address logged (84.245.33.122) belonged to his ISP (Cambrium IT Services BV). All of the AZF bootnodes seem to be hosted on Amazon, so there was a mismatch of sort

the connection attempted was using one of my "regular" peers IP address along with a peer_id belonging to an actual AZF bootnode (the first on the bootnode list): bootnode-eu-central-1-0.test.azero.dev

bootnode-eu-central-1-0.test.azero.dev resolved to the correct IP address (18.184.79.200) on the same testnet machine the problem persisted until i restarted aleph-node. after restarting, the issue disappeared

Information on your setup.

testnet validator ✌️ version 0.10.1-9c6a46952d3 (built from source) node name: QNODE-TEST peer_id: 12D3KooWMW86EoNYwBar6e5Gcnmaa13N6JsnzYVPPVGsmvjUddWt OS: ubuntu 22.04.02 hardware: VM on ESXi hypervisor, AMD AMD EPYC 7302 (8c/16t dedicated), 64GB ECC RAM, 1×100 GB SSD, 1×1.92 TB SSD NVMe, 1G connectivity, business fibre (static IP) IP address: 206.116.52.219

Steps to reproduce

not sure how to reproduce

backup-aleph-node-testnet-2023-04-02T1406.zip

Did you attach relevant logs?

Marcin-Radecki commented 1 year ago

Hi, anything particular happened with network from your validator perspective around 2023-04-02 13:07:13? It seems that your node was not communicating other peers (no bytes sent/received), which fixed itself 15 minutes later:

2023-04-02 13:07:12 ✨ Imported #27490656 (0xb6f4…46e7)
2023-04-02 13:07:13 Clique Network status: not maintaining any connections;
2023-04-02 13:07:13 Network status report: authentication connected peers - 99; block sync connected peers - 97;
2023-04-02 13:07:14 💤 Idle (33 peers), best: #27490656 (0xb6f4…46e7), finalized #27490653 (0x4bc1…6c85), ⬇ 15.1kiB/s ⬆ 8.4kiB/s
2023-04-02 13:07:19 💤 Idle (33 peers), best: #27490656 (0xb6f4…46e7), finalized #27490653 (0x4bc1…6c85), ⬇ 0 ⬆ 0.9kiB/s
2023-04-02 13:07:24 💤 Idle (33 peers), best: #27490656 (0xb6f4…46e7), finalized #27490653 (0x4bc1…6c85), ⬇ 0 ⬆ 0.8kiB/s
2023-04-02 13:07:29 💤 Idle (33 peers), best: #27490656 (0xb6f4…46e7), finalized #27490653 (0x4bc1…6c85), ⬇ 0 ⬆ 0.5kiB/s
2023-04-02 13:07:31 Justification requester status report: tries - 4; requested block number - 27490655; hash - 0x8004…9965;tries - 4; requested 1 children of finalized block 0x4bc1…6c85;
2023-04-02 13:07:33 Clique Network status: not maintaining any connections;
2023-04-02 13:07:33 Network status report: authentication connected peers - 99; block sync connected peers - 97;
2023-04-02 13:07:34 💤 Idle (26 peers), best: #27490656 (0xb6f4…46e7), finalized #27490653 (0x4bc1…6c85), ⬇ 0 ⬆ 0.9kiB/s
2023-04-02 13:07:39 💤 Idle (25 peers), best: #27490656 (0xb6f4…46e7), finalized #27490653 (0x4bc1…6c85), ⬇ 0 ⬆ 1.9kiB/s

QNODE-CA commented 1 year ago

hmm looks like a brief network outage, let me have a look :)

QNODE-CA commented 1 year ago

ok, good find! Looks like I can reproduce this by disconnecting the network for a few seconds:

on the same node, nic disconnected at 2023-04-04 09:31:16

2023-04-04 09:31:14 💤 Idle (40 peers), best: #27649905 (0xa896…d21d), finalized #27649901 (0x62ad…198b), ⬇ 107.8kiB/s ⬆ 96.6kiB/s 2023-04-04 09:31:15 ✨ Imported #27649906 (0x6beb…df36) 2023-04-04 09:31:16 ✨ Imported #27649907 (0x5e30…9559) 2023-04-04 09:31:16 📪 No longer listening on /ip4/10.0.50.60/tcp/30334

nic reconnected at 2023-04-04 09:32:19

2023-04-04 09:32:14 💤 Idle (0 peers), best: #27649907 (0x5e30…9559), finalized #27649901 (0x62ad…198b), ⬇ 0 ⬆ 0 2023-04-04 09:32:19 Justification requester status report: tries - 15; requested block number - 27649906; hash - 0x6beb…df36;tries - 16; requested 1 children of finalized block 0x62ad…198b; 2023-04-04 09:32:19 💤 Idle (0 peers), best: #27649907 (0x5e30…9559), finalized #27649901 (0x62ad…198b), ⬇ 0 ⬆ 0 2023-04-04 09:32:21 💔 The bootnode you want to connect to at/ip4/84.245.33.122/tcp/30333/p2p/12D3KooWRkGLz4YbVmrsWK75VjFTs8NvaBu42xhAmQaP4KeJpw1Lprovided a different peer ID12D3KooWJgniH6kTFxLAZUVaP2nE7ohrMVjFMXBbbyph1uGYFDL6than the one you expect12D3KooWRkGLz4YbVmrsWK75VjFTs8NvaBu42xhAmQaP4KeJpw1L. 2023-04-04 09:32:22 ✨ Imported #27649972 (0xdfba…c3ca)

still not sure why node keeps picking on poor 84.245.33.122 (at least it's consistent) and match it with a peer_id from the BOOT_NODES list

azero-testnet-1 libp2p bootnode bug.zip

QNODE-CA commented 1 year ago

this happened again on a different testnet node (QNODE-TEST-2). This node in located in a DC, and has suffered no loss of connectivity, AFAICT. 40 peers solid throughout. Also, it self-corrected in about half a minute. log attached

azero-testnet-2 - 2023-04-06 112056 - odd bootnode IP-peer_id.zip

Marcin-Radecki commented 1 year ago

Hi, thanks for the reports. As per original logs, it does not seem to be issue affecting the chain. As per last log provided, There is one moment number of peers is 39, then it becomes 40 again, so I think there's some network activity done in the background. It's hard to tell at this moment where the issue is as it needs local reproduction to with some log levels set etc.

Marcin-Radecki commented 1 year ago

Hi, please let me know if this issue can be closed, any more reproduction so far?

QNODE-CA commented 1 year ago

Hi, I see this happening occasionally (every couple of weeks or so) on my mainnet node. Last instance was just a couple of days ago:

2023-06-25 03:58:12 💔 The bootnode you want to connect to at /ip4/37.75.153.26/tcp/30333/p2p/12D3KooWJnEGVVmnXhVNxV6KWe3EsDPNvPFNcYbQ6amFVGECVAGB provided a different peer ID 12D3KooWSWzDVGzH9UEeQajj1vgWTC1k4Kg31vX65bWiCXJcERK1 than the one you expect 12D3KooWJnEGVVmnXhVNxV6KWe3EsDPNvPFNcYbQ6amFVGECVAGB.

I did not see any of this on any of my testnet nodes, since upgrading to ver. 0.11.x

QNODE-CA commented 1 year ago

mainnet node logged another couple of these:

2023-07-08 03:12:34 💔 The bootnode you want to connect to at /ip4/37.75.153.26/tcp/30333/p2p/12D3KooWJnEGVVmnXhVNxV6KWe3EsDPNvPFNcYbQ6amFVGECVAGB provided a different peer ID 12D3KooWSWzDVGzH9UEeQajj1vgWTC1k4Kg31vX65bWiCXJcERK1 than the one you expect 12D3KooWJnEGVVmnXhVNxV6KWe3EsDPNvPFNcYbQ6amFVGECVAGB.

I can see no similar entries for testnet nodes. Hopefully this is affecting ver. 0.10.x only.

QNODE-CA commented 1 year ago

today my testnet-2 node also experienced the same issue:

2023-07-18 11:24:32 💔 The bootnode you want to connect to at /dns4/bootnode-us-east-2-0.test.azero.dev/tcp/30333/p2p/12D3KooWRkGLz4YbVmrsWK75VjFTs8NvaBu42xhAmQaP4KeJpw1L provided a different peer ID 12D3KooWD5s2dkifJua69RbLwEREDdJjsNHvavNRGxdCvzhoeaLc than the one you expect 12D3KooWRkGLz4YbVmrsWK75VjFTs8NvaBu42xhAmQaP4KeJpw1L.

I guess ver. 0.11.3 is not immune from this after all.