Chia-Network / chia-blockchain

Chia blockchain python implementation (full node, farmer, harvester, timelord, and wallet)
Apache License 2.0
10.82k stars 2.03k forks source link

[Bug] Stale peers can't be removed #9172

Closed bhorvitz closed 1 year ago

bhorvitz commented 2 years ago

What happened?

If a peer has not connected in a long time, "chia show -r peer_id" does not remove it.

Example:

FULL_NODE <removed>                          40698/8444  23cf14b6... Nov 06 15:45:25     13.7|1.2    
                                                 -SB Height:  1104692    -Hash: 9476b861...
FULL_NODE <removed>                          43361/8444  4528e678... Nov 06 11:09:24      0.2|0.0    
                                                 -SB Height:   644118    -Hash: 759b23eb...

First node is shown for reference of current height and time. Second node is far behind and last contact was more than 4 hours ago.

If I then execute "chia show -r 4528e678", that peer is not removed from the table. With INFO level logging, nothing is recorded in the log. If I execute a remove via RPC call, the return is {"success": true} however, the peer is still not removed.

Version

1.2.11

What platform are you using?

Linux

What ui mode are you using?

CLI

Relevant log output

N/A
sargonas commented 2 years ago

I just tested and can confirm that manual removal of a healthy peer works just fine, so whatever is affecting this must be unique to them being stale. I do see that next to FULL_NODE you have instead of an ip address as well, so something about their state is definitely at issue here.

I would have asked you to switch your log level from INFO to DEBUG and see if it writes anything this time, but understandably that won't work since after you change your log level you'll need to restart your node for it to take effect, which will kill all these connections anyways. That being said however, if stale peers iare something you get frequently and know changing your log level would give you the opportunity to find this issue again and (potentially) get more solid log output, that may help us.

bhorvitz commented 2 years ago

I didn't want to change log level for just that reason but apparently being at an end here, I did. I think that I may have caused this on my end rather than it being a problem with a peer. After restart, I randomly chose someone far out of sync:

FULL_NODE <removed>                          61147/8444  40cfd0cd... Nov 07 07:12:29      0.3|0.0    
                                                 -SB Height:    23424    -Hash: fe5f2ce3...

I then ran a manual remove from the CLI and it hung (I don't recall whether or not this happened before). I had to ^C out of that and then upon trying to run it again, I get the remove success and it remains in the table.

Looking in the debug level logs, the very last thing referencing that node looks like normal conversation (I assume). The attempted first removal that hung was directly after this point. Neither the first, nor subsequent removal attempts produce any log output that I can see. This is simply a grep for 40cfd0cd and last entries with that reference:

2021-11-07T07:12:28.704 full_node full_node_server        : DEBUG    -> new_signage_point_or_end_of_sub_slot to peer <removed> 40cfd0cd5f0fc5cbbb0f17494d400caf99c71edd65de1a7d27f0d4b5dc2621e2
2021-11-07T07:12:29.016 full_node full_node_server        : DEBUG    <- request_signage_point_or_end_of_sub_slot from peer 40cfd0cd5f0fc5cbbb0f17494d400caf99c71edd65de1a7d27f0d4b5dc2621e2 <removed>
2021-11-07T07:12:29.016 full_node full_node_server        : DEBUG    Time taken to process request_signage_point_or_end_of_sub_slot from 40cfd0cd5f0fc5cbbb0f17494d400caf99c71edd65de1a7d27f0d4b5dc2621e2 is 0.00043845176696777344 seconds
2021-11-07T07:12:29.017 full_node full_node_server        : DEBUG    -> respond_end_of_sub_slot to peer <removed> 40cfd0cd5f0fc5cbbb0f17494d400caf99c71edd65de1a7d27f0d4b5dc2621e2
2021-11-07T07:12:29.443 full_node full_node_server        : DEBUG    Time taken to process request_proof_of_weight from 40cfd0cd5f0fc5cbbb0f17494d400caf99c71edd65de1a7d27f0d4b5dc2621e2 is 3.1794321537017822 seconds
2021-11-07T07:12:29.473 full_node full_node_server        : DEBUG    <- request_signage_point_or_end_of_sub_slot from peer 40cfd0cd5f0fc5cbbb0f17494d400caf99c71edd65de1a7d27f0d4b5dc2621e2 <removed>
2021-11-07T07:12:29.474 full_node full_node_server        : DEBUG    Time taken to process request_signage_point_or_end_of_sub_slot from 40cfd0cd5f0fc5cbbb0f17494d400caf99c71edd65de1a7d27f0d4b5dc2621e2 is 0.0014405250549316406 seconds

Not really wanting to keep it at debug, I set it back to info but I copied off the whole section of log that includes this series of events if there's anything else that might be of use there.

sargonas commented 2 years ago

thanks, not sure how much more that log info will help but at least we gave it a shot to see what it shared!

I'll float this around with some folks and also see if we have a way to replicate forcing a stale peer to trigger it ourselves.

bhorvitz commented 2 years ago

So I have been poking on this for the last couple of days and I am able to reliably reproduce the stale.

I'm not sure what the condition is that causes it, but if you remove enough peers, you will eventually run into one that doesn't delete right away. I scripted something to randomly drop peers (I know, bad neighbor) and keep track of the execution time. Mostly, it's always ~0.008s to complete the task, but some take much longer.

Here's an example of one that took a long time: 2021-11-08T09:37:48 Removing Node c489b28d... - Success (47.319s)

I also notice that at any logging level, nothing is written to the log until it finishes execution:

2021-11-08T09:37:48.678 full_node full_node_server        : INFO     Connection closed: <removed>, node id: c489b28db3d9b34bdf456ac603c67249fa36d79ecde8d64f6621f029a6e85c4f
2021-11-08T09:37:48.678 full_node chia.full_node.full_node: INFO     peer disconnected {'host': '<removed>', 'port': 8444}

On any of those that take a long time, if you kill the task before it finishes, the node becomes a stale in the table.

Of course I won't be killing any of those tasks anymore! But as a user sitting on the CLI, it's real tempting to hit ^C when something is apparently stuck doing nothing.

I didn't find any that wouldn't finish eventually, if patient enough.

emlowe commented 2 years ago

We appreciate this bug report and your diagnostics so far. There is very certainly some issue here, however, it has a very low priority for us to fix. We will keep this open for now to see if there are further reports, but it's not expected that we will get to this in the near or medium term. PRs would be accepted of course.

cybern0id commented 2 years ago

Having the same issue using 1.2.11 but have also had this issue in previous builds. Have had it with various node IP addresses at various times. Often the peer that is "stuck" has SB Height = 0 but not always. Checking netstat -tunp shows no connection to the IP address in question. When attempting removal using chia show -r NodeID, cli shows "attempting" with no error but nothing is added to debug.log in INFO mode. Will attempt to capture DEBUG level.

jayhohoho2019 commented 2 years ago

I filed #9621 and was reminded in keybase about this one. Let me know if you want me to close #9621

skweee commented 2 years ago

If you want to get rid of the stale nodes, there is a counter-intuitive workaround: just manually add the stale node again and it will get removed. chia show -a <IP address of the stale node>:8444

jayhohoho2019 commented 2 years ago

This didn't work for me. Got Connecting to IP,8444 Failed to connect to IP:8444 Then the node still shows on chia show -c, and chia show -r still doesn't do anything.

iliasorazov commented 2 years ago

There is a race condition. The ChiaServer class is creating a WSChiaConnection passing self.connection_closed callback to its init method. Then it calls self.connection_added(connection, ...) adding connection to the list of peers. Connection may have been closed in between and thus it is leaving an already closed connection in the list.

The WSChiaConnection.close method only calls close_callback once.

https://github.com/Chia-Network/chia-blockchain/blob/659f4f489081939387d6704b77c4477edb278ae1/chia/server/ws_connection.py#L195-L196

this is causing the issue described. Even if you try to manually disconnect it, the close method won't call close_callback and the connection will remain in the list.

hansgfixed commented 2 years ago

Same issue but didn't look into it. Removing them from Chia GUI one by one and they re-appear a few minutes later.

alnuweiri commented 2 years ago

I have had the same issue for a long time, I was living with it, but now after 1.3.1 release, it's becoming a pain in the ass. The whole GUI will crash when trying to disconnect a stale peer. It doesn't crash if I try to disconnect using CLI though. I am using Windows 10. Now I have 6 stale peers that I can't disconnect, and I am bored of restarting Chia every few days.

jayhohoho2019 commented 2 years ago

Update. v1.3.1 on Ubuntu 20.04 LTS server. Node is fully synced. The following peer shows a last connection date of days ago, and Height: No Info. Unable to manually remove it. Tried to add it first as a workaround and that didn't work either.

FULL_NODE 24.12.7.58 8444/8444 53f769cd... Mar 25 14:42:16 0.0|0.0 -Height: No Info -Hash: No Info

Why would a bug about disconnecting stale peers take so long to fix though?

greyslater commented 2 years ago

Update. v1.3.2 on Ubuntu 20.04 LTS server. Same here - peer with -Height: No Info -Hash: No Info

"re-adding" with chia show -a IP:Port did work for me immediately.

gschlager commented 2 years ago

Any chance we can get this fixed in the upcoming release if it's a simple fix? https://github.com/Chia-Network/chia-blockchain/issues/9172#issuecomment-1000281772 seems to have indentified the issue.

hansgfixed commented 2 years ago

Still not working.

HVNC1 commented 2 years ago

Still not working.

jayhohoho2019 commented 2 years ago

Today two of my connected peers show Height: No Info, and ~3 hours later still the same. So not dropped automatically by the connection manager. When I tried to do chia show -r to them, command returned almost right away but the peers are still showing as connected. v1.6.0. on Ubuntu Server.

jayhohoho2019 commented 2 years ago

Well, when I did chia show -a ip:8444 to such a peer, it immediately disappeared from chia show -c result.

LudmilaAlex commented 1 year ago

Version 1.6.0

What platform are you using? Windows 11

What ui mode are you using? CLI

No Type_node DeltaHeight Height IP ID_Node month day time MiB_UpDwn


0 FULL_NODE 1012 2730941 197.30.219.130 9482fb49... Oct 25 14:56:30 0.9|0.0
1 FULL_NODE 471 2731482 178.32.206.2 e065adb3... Oct 25 17:40:47 20.9|0.0 2 FULL_NODE 2 2731951 178.79.75.177 0ef97871... Oct 25 20:00:51 63.7|1.4 3 FULL_NODE 1 2731952 149.0.16.249 51af2d04... Oct 25 20:00:51 30.4|2.2 4 FULL_NODE 1 2731952 2.176.63.231 cf22500c... Oct 25 20:00:51 13.8|2.8 5 FULL_NODE 1 2731952 5.38.235.180 d86904c5... Oct 25 20:00:51 26.0|3.6 6 FULL_NODE 1 2731952 2.179.221.148 df16d7dc... Oct 25 20:00:51 17.7|1.2 7 FULL_NODE 1 2731952 50.5.142.46 c9958bfe... Oct 25 20:00:51 36.2|5.3 8 FULL_NODE 1 2731952 99.254.47.76 b20c87f7... Oct 25 20:00:51 5.1|7.1
9 FULL_NODE 1 2731952 185.250.252.222 1c1f1f8e... Oct 25 20:00:51 8.0|6.3
10 FULL_NODE 1 2731952 79.46.146.226 8dd926ce... Oct 25 20:00:51 15.6|2.8 11 FULL_NODE 1 2731952 183.195.9.28 c4d737f9... Oct 25 20:00:51 64.4|1.2 12 FULL_NODE 1 2731952 162.233.75.134 9890cce7... Oct 25 20:00:51 5.0|3.5
13 FULL_NODE 1 2731952 85.237.227.34 63725fce... Oct 25 20:00:51 37.9|1.5 14 FULL_NODE 1 2731952 46.114.174.148 8db2da12... Oct 25 20:00:51 27.8|2.0 15 FULL_NODE 1 2731952 120.194.14.114 026b2cc2... Oct 25 20:00:51 56.2|2.8 16 FULL_NODE 1 2731952 185.80.194.231 049ec95a... Oct 25 20:00:52 45.3|3.2 17 FULL_NODE 1 2731952 95.53.10.52 4ea1f0bf... Oct 25 20:00:51 18.1|5.2 18 FULL_NODE 1 2731952 202.188.15.80 4fac1d0e... Oct 25 20:00:51 30.0|1.9 19 FULL_NODE 1 2731952 185.203.216.58 b409d6d0... Oct 25 20:00:51 30.6|3.1 20 FULL_NODE 1 2731952 190.245.192.52 f0a197af... Oct 25 20:00:52 31.2|1.4 21 FULL_NODE 1 2731952 217.114.38.94 d159a032... Oct 25 20:00:51 27.3|1.4 22 FULL_NODE 1 2731952 146.115.119.191 462df0df... Oct 25 20:00:51 7.1|0.6
23 FULL_NODE 1 2731952 181.29.175.159 7a907f5f... Oct 25 20:00:51 5.6|0.2
24 FULL_NODE 1 2731952 183.89.14.48 e4831dc4... Oct 25 20:00:51 24.1|0.2 25 FULL_NODE 1 2731952 195.133.143.205 25ea3643... Oct 25 20:00:51 4.1|0.2
26 FULL_NODE 1 2731952 61.140.24.177 3565b2d1... Oct 25 20:00:51 12.9|0.4 27 FULL_NODE 1 2731952 219.83.188.12 759e51a7... Oct 25 20:00:51 22.7|0.4 28 FULL_NODE 1 2731952 185.42.44.191 65086e7b... Oct 25 20:00:51 10.9|1.0 29 FULL_NODE 1 2731952 72.219.79.9 9989fa18... Oct 25 20:00:51 8.7|0.3
30 FULL_NODE 1 2731952 183.193.169.212 31e6cef0... Oct 25 20:00:51 26.4|1.0 31 FULL_NODE 1 2731952 94.232.7.49 58050ed1... Oct 25 20:00:51 23.5|0.8 32 FULL_NODE 1 2731952 195.18.19.76 af97adc2... Oct 25 20:00:51 15.9|0.8 33 FULL_NODE 1 2731952 83.135.186.136 e85af759... Oct 25 20:00:51 6.7|5.7
34 FULL_NODE 1 2731952 120.226.218.212 665b10a4... Oct 25 20:00:51 12.7|1.8 35 FULL_NODE 1 2731952 110.169.249.52 20d1c763... Oct 25 20:00:51 28.8|1.0 36 FULL_NODE 1 2731952 42.48.97.141 3cf6a69b... Oct 25 20:00:51 20.5|1.1 37 FULL_NODE 1 2731952 118.172.200.157 c340346d... Oct 25 20:00:51 13.0|2.6 38 FULL_NODE 1 2731952 120.230.79.124 24bdb05e... Oct 25 20:00:51 0.5|0.0
39 FULL_NODE 1 2731952 39.112.216.137 2758f336... Oct 25 20:00:51 0.6|0.9
40 FULL_NODE 1 2731952 186.22.14.250 dc3af71e... Oct 25 20:00:51 44.8|2.5 41 FULL_NODE 1 2731952 36.108.110.234 790ed29b... Oct 25 20:00:51 58.0|2.6 42 FULL_NODE 1 2731952 188.234.82.53 856e2c4c... Oct 25 20:00:51 7.1|12.6 43 FULL_NODE 1 2731952 181.84.122.134 b3bfbcd0... Oct 25 20:00:51 5.0|3.3
44 FULL_NODE 1 2731952 171.4.223.247 65e28726... Oct 25 20:00:51 13.8|4.4 45 FULL_NODE 1 2731952 113.91.169.232 9953887e... Oct 25 20:00:51 48.6|3.1 46 FULL_NODE 1 2731952 80.50.233.18 a57a681c... Oct 25 20:00:51 38.9|2.8 47 FULL_NODE 1 2731952 80.147.170.61 64443020... Oct 25 20:00:51 47.7|3.8 48 FULL_NODE 1 2731952 166.70.106.139 49e81082... Oct 25 20:00:51 5.0|46.2 49 FULL_NODE 1 2731952 46.35.227.68 a6484dfd... Oct 25 20:00:51 36.8|4.4 50 FULL_NODE 1 2731952 213.225.9.234 4a9e3959... Oct 25 20:00:51 31.2|3.5 51 FULL_NODE 1 2731952 115.171.198.21 4410eff1... Oct 25 20:00:51 46.6|3.4 52 FULL_NODE 1 2731952 81.254.189.7 5977797e... Oct 25 20:00:51 28.0|5.2 53 FULL_NODE 1 2731952 223.10.231.22 2b335200... Oct 25 20:00:51 41.9|4.2 54 FULL_NODE 1 2731952 88.69.13.221 6b02fbd5... Oct 25 20:00:51 16.7|12.4 55 FULL_NODE 1 2731952 89.28.164.64 0a192d54... Oct 25 20:00:51 31.9|5.6 56 FULL_NODE 1 2731952 173.2.147.80 cc3e5ede... Oct 25 20:00:51 8.0|20.3 57 FULL_NODE 1 2731952 2.155.193.31 e5c33e5b... Oct 25 20:00:51 24.4|9.9 58 FULL_NODE 1 2731952 5.204.122.111 de7222ca... Oct 25 20:00:51 40.5|2.7 59 FULL_NODE 1 2731952 183.226.2.148 2b7cc3dd... Oct 25 20:00:51 45.0|2.1 60 FULL_NODE 1 2731952 78.80.47.235 5412fb24... Oct 25 20:00:51 9.4|16.5 61 FULL_NODE 1 2731952 46.16.214.121 c0870de4... Oct 25 20:00:51 6.4|16.5 62 FULL_NODE 1 2731952 109.165.141.134 603cf65f... Oct 25 20:00:51 29.7|5.7 63 FULL_NODE 1 2731952 124.168.87.205 054de5bd... Oct 25 20:00:51 24.2|2.4 64 FULL_NODE 1 2731952 114.233.112.131 477e7b19... Oct 25 20:00:52 16.7|38.0 65 FULL_NODE 1 2731952 159.146.43.247 4253bffb... Oct 25 20:00:51 30.6|2.4 66 FULL_NODE 1 2731952 85.93.58.245 95d92a8f... Oct 25 20:00:51 31.2|2.2 67 FULL_NODE 1 2731952 120.231.85.209 56312e82... Oct 25 20:00:51 19.4|3.4 68 FULL_NODE 1 2731952 37.15.164.222 a507b440... Oct 25 20:00:51 26.2|5.2 69 FULL_NODE 1 2731952 178.213.231.1 6555c768... Oct 25 20:00:51 18.3|13.5 70 FULL_NODE 1 2731952 205.178.115.193 13f9d400... Oct 25 20:00:51 7.3|23.6 71 FULL_NODE 1 2731952 91.203.176.83 55d92298... Oct 25 20:00:51 29.1|2.7 72 FULL_NODE 1 2731952 149.224.4.131 2a8e35d2... Oct 25 20:00:51 25.5|3.3 73 FULL_NODE 1 2731952 176.215.61.74 9c80f9b9... Oct 25 20:00:51 24.4|5.0 74 FULL_NODE 1 2731952 165.227.219.140 01455b8e... Oct 25 20:00:51 8.2|3.9
75 FULL_NODE 1 2731952 167.61.182.242 745106a1... Oct 25 20:00:51 54.4|3.7 76 FULL_NODE 1 2731952 185.253.160.10 a5fb9601... Oct 25 20:00:51 28.3|3.6 77 FULL_NODE 1 2731952 117.89.20.219 f997e9ab... Oct 25 20:00:52 24.3|9.3 78 FULL_NODE 0 2731953 89.109.48.224 81960e42... Oct 25 20:00:52 10.6|2.7 79 FULL_NODE 0 2731953 195.155.100.195 79af1820... Oct 25 20:00:52 7.9|48.4


Two stale peers cannot be removed 0 FULL_NODE 1012 2730941 197.30.219.130 9482fb49... Oct 25 14:56:30 0.9|0.0
1 FULL_NODE 471 2731482 178.32.206.2 e065adb3... Oct 25 17:40:47 20.9|0.0

These nodes do not ping
Ping 197.30.219.130 Ping stats for 197.30.219.130: Packets: Sent = 4, Received = 0, Lost = 4 (100% loss)

Ping 178.32.206.2 Ping stats for 178.32.206.2: Packets: Sent = 4, Received = 0, Lost = 4 (100% loss)


cannot detach directly (.\chia show -r id )

.\chia show -r 9482fb49 Attempting to disconnect NodeID 9482fb49 NodeID 9482fb49... FULL_NODE

try again .\chia show -r 9482fb49 Attempting to disconnect NodeID 9482fb49 NodeID 9482fb49... FULL_NODE

.\chia show -r e065adb3 Attempting to disconnect NodeID e065adb3 NodeID e065adb3... FULL_NODE

try again

.\chia show -r e065adb3 Attempting to disconnect NodeID e065adb3 NodeID e065adb3... FULL_NODE +++++++++++++++++++++ also can not be disconnected bypass (.\chia show -a ) .\chia show -a 197.30.219.130:8444 Connecting to 197.30.219.130, 8444 Failed to connect to 197.30.219.130:844

try again

.\chia show -a 197.30.219.130:8444 Connecting to 197.30.219.130, 8444 Failed to connect to 197.30.219.130:844


.\chia show -a 178.32.206.2:8444 Connecting to 178.32.206.2, 8444 Failed to connect to 178.32.206.2:8444

try again

.\chia show -a 178.32.206.2:8444 Connecting to 178.32.206.2, 8444 Failed to connect to 178.32.206.2:8444

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! It is very frustrating that a restart is required to disconnect the old nodes (.\chia.exe start node -r). And then more than a dozen hours for about 80 nodes to connect. During this time, new obsolete nodes reappear (Stale peers). Most likely this is caused by their sudden shutdowns. Thus, we have an unpleasant old bug, and in the current version of chia-blockchain 1.6.0. It is very sad that the developers have not yet fixed it. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

emlowe commented 1 year ago

I believe this is fixed in #13727 and should be included with 1.7.0 when released.

emlowe commented 1 year ago

Closing issue - please comment if 1.6.2 does not resolve this