Closed bhorvitz closed 1 year ago
I just tested and can confirm that manual removal of a healthy peer works just fine, so whatever is affecting this must be unique to them being stale. I do see that next to FULL_NODE you have
I would have asked you to switch your log level from INFO to DEBUG and see if it writes anything this time, but understandably that won't work since after you change your log level you'll need to restart your node for it to take effect, which will kill all these connections anyways. That being said however, if stale peers iare something you get frequently and know changing your log level would give you the opportunity to find this issue again and (potentially) get more solid log output, that may help us.
I didn't want to change log level for just that reason but apparently being at an end here, I did. I think that I may have caused this on my end rather than it being a problem with a peer. After restart, I randomly chose someone far out of sync:
FULL_NODE <removed> 61147/8444 40cfd0cd... Nov 07 07:12:29 0.3|0.0
-SB Height: 23424 -Hash: fe5f2ce3...
I then ran a manual remove from the CLI and it hung (I don't recall whether or not this happened before). I had to ^C out of that and then upon trying to run it again, I get the remove success and it remains in the table.
Looking in the debug level logs, the very last thing referencing that node looks like normal conversation (I assume). The attempted first removal that hung was directly after this point. Neither the first, nor subsequent removal attempts produce any log output that I can see. This is simply a grep for 40cfd0cd and last entries with that reference:
2021-11-07T07:12:28.704 full_node full_node_server : DEBUG -> new_signage_point_or_end_of_sub_slot to peer <removed> 40cfd0cd5f0fc5cbbb0f17494d400caf99c71edd65de1a7d27f0d4b5dc2621e2
2021-11-07T07:12:29.016 full_node full_node_server : DEBUG <- request_signage_point_or_end_of_sub_slot from peer 40cfd0cd5f0fc5cbbb0f17494d400caf99c71edd65de1a7d27f0d4b5dc2621e2 <removed>
2021-11-07T07:12:29.016 full_node full_node_server : DEBUG Time taken to process request_signage_point_or_end_of_sub_slot from 40cfd0cd5f0fc5cbbb0f17494d400caf99c71edd65de1a7d27f0d4b5dc2621e2 is 0.00043845176696777344 seconds
2021-11-07T07:12:29.017 full_node full_node_server : DEBUG -> respond_end_of_sub_slot to peer <removed> 40cfd0cd5f0fc5cbbb0f17494d400caf99c71edd65de1a7d27f0d4b5dc2621e2
2021-11-07T07:12:29.443 full_node full_node_server : DEBUG Time taken to process request_proof_of_weight from 40cfd0cd5f0fc5cbbb0f17494d400caf99c71edd65de1a7d27f0d4b5dc2621e2 is 3.1794321537017822 seconds
2021-11-07T07:12:29.473 full_node full_node_server : DEBUG <- request_signage_point_or_end_of_sub_slot from peer 40cfd0cd5f0fc5cbbb0f17494d400caf99c71edd65de1a7d27f0d4b5dc2621e2 <removed>
2021-11-07T07:12:29.474 full_node full_node_server : DEBUG Time taken to process request_signage_point_or_end_of_sub_slot from 40cfd0cd5f0fc5cbbb0f17494d400caf99c71edd65de1a7d27f0d4b5dc2621e2 is 0.0014405250549316406 seconds
Not really wanting to keep it at debug, I set it back to info but I copied off the whole section of log that includes this series of events if there's anything else that might be of use there.
thanks, not sure how much more that log info will help but at least we gave it a shot to see what it shared!
I'll float this around with some folks and also see if we have a way to replicate forcing a stale peer to trigger it ourselves.
So I have been poking on this for the last couple of days and I am able to reliably reproduce the stale.
I'm not sure what the condition is that causes it, but if you remove enough peers, you will eventually run into one that doesn't delete right away. I scripted something to randomly drop peers (I know, bad neighbor) and keep track of the execution time. Mostly, it's always ~0.008s to complete the task, but some take much longer.
Here's an example of one that took a long time:
2021-11-08T09:37:48 Removing Node c489b28d... - Success (47.319s)
I also notice that at any logging level, nothing is written to the log until it finishes execution:
2021-11-08T09:37:48.678 full_node full_node_server : INFO Connection closed: <removed>, node id: c489b28db3d9b34bdf456ac603c67249fa36d79ecde8d64f6621f029a6e85c4f
2021-11-08T09:37:48.678 full_node chia.full_node.full_node: INFO peer disconnected {'host': '<removed>', 'port': 8444}
On any of those that take a long time, if you kill the task before it finishes, the node becomes a stale in the table.
Of course I won't be killing any of those tasks anymore! But as a user sitting on the CLI, it's real tempting to hit ^C when something is apparently stuck doing nothing.
I didn't find any that wouldn't finish eventually, if patient enough.
We appreciate this bug report and your diagnostics so far. There is very certainly some issue here, however, it has a very low priority for us to fix. We will keep this open for now to see if there are further reports, but it's not expected that we will get to this in the near or medium term. PRs would be accepted of course.
Having the same issue using 1.2.11 but have also had this issue in previous builds. Have had it with various node IP addresses at various times. Often the peer that is "stuck" has SB Height = 0 but not always. Checking netstat -tunp
shows no connection to the IP address in question. When attempting removal using chia show -r NodeID
, cli shows "attempting" with no error but nothing is added to debug.log in INFO mode. Will attempt to capture DEBUG level.
I filed #9621 and was reminded in keybase about this one. Let me know if you want me to close #9621
If you want to get rid of the stale nodes, there is a counter-intuitive workaround: just manually add the stale node again and it will get removed.
chia show -a <IP address of the stale node>:8444
This didn't work for me. Got Connecting to IP,8444
Failed to connect to IP:8444
Then the node still shows on chia show -c, and chia show -r
There is a race condition. The ChiaServer
class is creating a WSChiaConnection
passing self.connection_closed
callback to its init method. Then it calls self.connection_added(connection, ...)
adding connection to the list of peers. Connection may have been closed in between and thus it is leaving an already closed connection in the list.
The WSChiaConnection.close
method only calls close_callback once.
this is causing the issue described. Even if you try to manually disconnect it, the close method won't call close_callback and the connection will remain in the list.
Same issue but didn't look into it. Removing them from Chia GUI one by one and they re-appear a few minutes later.
I have had the same issue for a long time, I was living with it, but now after 1.3.1 release, it's becoming a pain in the ass. The whole GUI will crash when trying to disconnect a stale peer. It doesn't crash if I try to disconnect using CLI though. I am using Windows 10. Now I have 6 stale peers that I can't disconnect, and I am bored of restarting Chia every few days.
Update. v1.3.1 on Ubuntu 20.04 LTS server. Node is fully synced. The following peer shows a last connection date of days ago, and Height: No Info. Unable to manually remove it. Tried to add it first as a workaround and that didn't work either.
FULL_NODE 24.12.7.58 8444/8444 53f769cd... Mar 25 14:42:16 0.0|0.0 -Height: No Info -Hash: No Info
Why would a bug about disconnecting stale peers take so long to fix though?
Update. v1.3.2 on Ubuntu 20.04 LTS server. Same here - peer with -Height: No Info -Hash: No Info
"re-adding" with chia show -a IP:Port did work for me immediately.
Any chance we can get this fixed in the upcoming release if it's a simple fix? https://github.com/Chia-Network/chia-blockchain/issues/9172#issuecomment-1000281772 seems to have indentified the issue.
Still not working.
Still not working.
Today two of my connected peers show Height: No Info, and ~3 hours later still the same. So not dropped automatically by the connection manager. When I tried to do chia show -r to them, command returned almost right away but the peers are still showing as connected. v1.6.0. on Ubuntu Server.
Well, when I did chia show -a ip:8444 to such a peer, it immediately disappeared from chia show -c result.
Version 1.6.0
What platform are you using? Windows 11
What ui mode are you using? CLI
No Type_node DeltaHeight Height IP ID_Node month day time MiB_UpDwn
0 FULL_NODE 1012 2730941 197.30.219.130 9482fb49... Oct 25 14:56:30 0.9|0.0
1 FULL_NODE 471 2731482 178.32.206.2 e065adb3... Oct 25 17:40:47 20.9|0.0
2 FULL_NODE 2 2731951 178.79.75.177 0ef97871... Oct 25 20:00:51 63.7|1.4
3 FULL_NODE 1 2731952 149.0.16.249 51af2d04... Oct 25 20:00:51 30.4|2.2
4 FULL_NODE 1 2731952 2.176.63.231 cf22500c... Oct 25 20:00:51 13.8|2.8
5 FULL_NODE 1 2731952 5.38.235.180 d86904c5... Oct 25 20:00:51 26.0|3.6
6 FULL_NODE 1 2731952 2.179.221.148 df16d7dc... Oct 25 20:00:51 17.7|1.2
7 FULL_NODE 1 2731952 50.5.142.46 c9958bfe... Oct 25 20:00:51 36.2|5.3
8 FULL_NODE 1 2731952 99.254.47.76 b20c87f7... Oct 25 20:00:51 5.1|7.1
9 FULL_NODE 1 2731952 185.250.252.222 1c1f1f8e... Oct 25 20:00:51 8.0|6.3
10 FULL_NODE 1 2731952 79.46.146.226 8dd926ce... Oct 25 20:00:51 15.6|2.8
11 FULL_NODE 1 2731952 183.195.9.28 c4d737f9... Oct 25 20:00:51 64.4|1.2
12 FULL_NODE 1 2731952 162.233.75.134 9890cce7... Oct 25 20:00:51 5.0|3.5
13 FULL_NODE 1 2731952 85.237.227.34 63725fce... Oct 25 20:00:51 37.9|1.5
14 FULL_NODE 1 2731952 46.114.174.148 8db2da12... Oct 25 20:00:51 27.8|2.0
15 FULL_NODE 1 2731952 120.194.14.114 026b2cc2... Oct 25 20:00:51 56.2|2.8
16 FULL_NODE 1 2731952 185.80.194.231 049ec95a... Oct 25 20:00:52 45.3|3.2
17 FULL_NODE 1 2731952 95.53.10.52 4ea1f0bf... Oct 25 20:00:51 18.1|5.2
18 FULL_NODE 1 2731952 202.188.15.80 4fac1d0e... Oct 25 20:00:51 30.0|1.9
19 FULL_NODE 1 2731952 185.203.216.58 b409d6d0... Oct 25 20:00:51 30.6|3.1
20 FULL_NODE 1 2731952 190.245.192.52 f0a197af... Oct 25 20:00:52 31.2|1.4
21 FULL_NODE 1 2731952 217.114.38.94 d159a032... Oct 25 20:00:51 27.3|1.4
22 FULL_NODE 1 2731952 146.115.119.191 462df0df... Oct 25 20:00:51 7.1|0.6
23 FULL_NODE 1 2731952 181.29.175.159 7a907f5f... Oct 25 20:00:51 5.6|0.2
24 FULL_NODE 1 2731952 183.89.14.48 e4831dc4... Oct 25 20:00:51 24.1|0.2
25 FULL_NODE 1 2731952 195.133.143.205 25ea3643... Oct 25 20:00:51 4.1|0.2
26 FULL_NODE 1 2731952 61.140.24.177 3565b2d1... Oct 25 20:00:51 12.9|0.4
27 FULL_NODE 1 2731952 219.83.188.12 759e51a7... Oct 25 20:00:51 22.7|0.4
28 FULL_NODE 1 2731952 185.42.44.191 65086e7b... Oct 25 20:00:51 10.9|1.0
29 FULL_NODE 1 2731952 72.219.79.9 9989fa18... Oct 25 20:00:51 8.7|0.3
30 FULL_NODE 1 2731952 183.193.169.212 31e6cef0... Oct 25 20:00:51 26.4|1.0
31 FULL_NODE 1 2731952 94.232.7.49 58050ed1... Oct 25 20:00:51 23.5|0.8
32 FULL_NODE 1 2731952 195.18.19.76 af97adc2... Oct 25 20:00:51 15.9|0.8
33 FULL_NODE 1 2731952 83.135.186.136 e85af759... Oct 25 20:00:51 6.7|5.7
34 FULL_NODE 1 2731952 120.226.218.212 665b10a4... Oct 25 20:00:51 12.7|1.8
35 FULL_NODE 1 2731952 110.169.249.52 20d1c763... Oct 25 20:00:51 28.8|1.0
36 FULL_NODE 1 2731952 42.48.97.141 3cf6a69b... Oct 25 20:00:51 20.5|1.1
37 FULL_NODE 1 2731952 118.172.200.157 c340346d... Oct 25 20:00:51 13.0|2.6
38 FULL_NODE 1 2731952 120.230.79.124 24bdb05e... Oct 25 20:00:51 0.5|0.0
39 FULL_NODE 1 2731952 39.112.216.137 2758f336... Oct 25 20:00:51 0.6|0.9
40 FULL_NODE 1 2731952 186.22.14.250 dc3af71e... Oct 25 20:00:51 44.8|2.5
41 FULL_NODE 1 2731952 36.108.110.234 790ed29b... Oct 25 20:00:51 58.0|2.6
42 FULL_NODE 1 2731952 188.234.82.53 856e2c4c... Oct 25 20:00:51 7.1|12.6
43 FULL_NODE 1 2731952 181.84.122.134 b3bfbcd0... Oct 25 20:00:51 5.0|3.3
44 FULL_NODE 1 2731952 171.4.223.247 65e28726... Oct 25 20:00:51 13.8|4.4
45 FULL_NODE 1 2731952 113.91.169.232 9953887e... Oct 25 20:00:51 48.6|3.1
46 FULL_NODE 1 2731952 80.50.233.18 a57a681c... Oct 25 20:00:51 38.9|2.8
47 FULL_NODE 1 2731952 80.147.170.61 64443020... Oct 25 20:00:51 47.7|3.8
48 FULL_NODE 1 2731952 166.70.106.139 49e81082... Oct 25 20:00:51 5.0|46.2
49 FULL_NODE 1 2731952 46.35.227.68 a6484dfd... Oct 25 20:00:51 36.8|4.4
50 FULL_NODE 1 2731952 213.225.9.234 4a9e3959... Oct 25 20:00:51 31.2|3.5
51 FULL_NODE 1 2731952 115.171.198.21 4410eff1... Oct 25 20:00:51 46.6|3.4
52 FULL_NODE 1 2731952 81.254.189.7 5977797e... Oct 25 20:00:51 28.0|5.2
53 FULL_NODE 1 2731952 223.10.231.22 2b335200... Oct 25 20:00:51 41.9|4.2
54 FULL_NODE 1 2731952 88.69.13.221 6b02fbd5... Oct 25 20:00:51 16.7|12.4
55 FULL_NODE 1 2731952 89.28.164.64 0a192d54... Oct 25 20:00:51 31.9|5.6
56 FULL_NODE 1 2731952 173.2.147.80 cc3e5ede... Oct 25 20:00:51 8.0|20.3
57 FULL_NODE 1 2731952 2.155.193.31 e5c33e5b... Oct 25 20:00:51 24.4|9.9
58 FULL_NODE 1 2731952 5.204.122.111 de7222ca... Oct 25 20:00:51 40.5|2.7
59 FULL_NODE 1 2731952 183.226.2.148 2b7cc3dd... Oct 25 20:00:51 45.0|2.1
60 FULL_NODE 1 2731952 78.80.47.235 5412fb24... Oct 25 20:00:51 9.4|16.5
61 FULL_NODE 1 2731952 46.16.214.121 c0870de4... Oct 25 20:00:51 6.4|16.5
62 FULL_NODE 1 2731952 109.165.141.134 603cf65f... Oct 25 20:00:51 29.7|5.7
63 FULL_NODE 1 2731952 124.168.87.205 054de5bd... Oct 25 20:00:51 24.2|2.4
64 FULL_NODE 1 2731952 114.233.112.131 477e7b19... Oct 25 20:00:52 16.7|38.0
65 FULL_NODE 1 2731952 159.146.43.247 4253bffb... Oct 25 20:00:51 30.6|2.4
66 FULL_NODE 1 2731952 85.93.58.245 95d92a8f... Oct 25 20:00:51 31.2|2.2
67 FULL_NODE 1 2731952 120.231.85.209 56312e82... Oct 25 20:00:51 19.4|3.4
68 FULL_NODE 1 2731952 37.15.164.222 a507b440... Oct 25 20:00:51 26.2|5.2
69 FULL_NODE 1 2731952 178.213.231.1 6555c768... Oct 25 20:00:51 18.3|13.5
70 FULL_NODE 1 2731952 205.178.115.193 13f9d400... Oct 25 20:00:51 7.3|23.6
71 FULL_NODE 1 2731952 91.203.176.83 55d92298... Oct 25 20:00:51 29.1|2.7
72 FULL_NODE 1 2731952 149.224.4.131 2a8e35d2... Oct 25 20:00:51 25.5|3.3
73 FULL_NODE 1 2731952 176.215.61.74 9c80f9b9... Oct 25 20:00:51 24.4|5.0
74 FULL_NODE 1 2731952 165.227.219.140 01455b8e... Oct 25 20:00:51 8.2|3.9
75 FULL_NODE 1 2731952 167.61.182.242 745106a1... Oct 25 20:00:51 54.4|3.7
76 FULL_NODE 1 2731952 185.253.160.10 a5fb9601... Oct 25 20:00:51 28.3|3.6
77 FULL_NODE 1 2731952 117.89.20.219 f997e9ab... Oct 25 20:00:52 24.3|9.3
78 FULL_NODE 0 2731953 89.109.48.224 81960e42... Oct 25 20:00:52 10.6|2.7
79 FULL_NODE 0 2731953 195.155.100.195 79af1820... Oct 25 20:00:52 7.9|48.4
Two stale peers cannot be removed
0 FULL_NODE 1012 2730941 197.30.219.130 9482fb49... Oct 25 14:56:30 0.9|0.0
1 FULL_NODE 471 2731482 178.32.206.2 e065adb3... Oct 25 17:40:47 20.9|0.0
These nodes do not ping
Ping 197.30.219.130
Ping stats for 197.30.219.130:
Packets: Sent = 4, Received = 0, Lost = 4
(100% loss)
Ping 178.32.206.2 Ping stats for 178.32.206.2: Packets: Sent = 4, Received = 0, Lost = 4 (100% loss)
cannot detach directly (.\chia show -r id )
.\chia show -r 9482fb49 Attempting to disconnect NodeID 9482fb49 NodeID 9482fb49... FULL_NODE
try again .\chia show -r 9482fb49 Attempting to disconnect NodeID 9482fb49 NodeID 9482fb49... FULL_NODE
.\chia show -r e065adb3 Attempting to disconnect NodeID e065adb3 NodeID e065adb3... FULL_NODE
try again
.\chia show -r e065adb3
Attempting to disconnect NodeID e065adb3
NodeID e065adb3... FULL_NODE
+++++++++++++++++++++
also can not be disconnected bypass
(.\chia show -a
try again
.\chia show -a 197.30.219.130:8444 Connecting to 197.30.219.130, 8444 Failed to connect to 197.30.219.130:844
.\chia show -a 178.32.206.2:8444 Connecting to 178.32.206.2, 8444 Failed to connect to 178.32.206.2:8444
try again
.\chia show -a 178.32.206.2:8444 Connecting to 178.32.206.2, 8444 Failed to connect to 178.32.206.2:8444
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! It is very frustrating that a restart is required to disconnect the old nodes (.\chia.exe start node -r). And then more than a dozen hours for about 80 nodes to connect. During this time, new obsolete nodes reappear (Stale peers). Most likely this is caused by their sudden shutdowns. Thus, we have an unpleasant old bug, and in the current version of chia-blockchain 1.6.0. It is very sad that the developers have not yet fixed it. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
I believe this is fixed in #13727 and should be included with 1.7.0 when released.
Closing issue - please comment if 1.6.2 does not resolve this
What happened?
If a peer has not connected in a long time, "chia show -r peer_id" does not remove it.
Example:
First node is shown for reference of current height and time. Second node is far behind and last contact was more than 4 hours ago.
If I then execute "chia show -r 4528e678", that peer is not removed from the table. With INFO level logging, nothing is recorded in the log. If I execute a remove via RPC call, the return is {"success": true} however, the peer is still not removed.
Version
1.2.11
What platform are you using?
Linux
What ui mode are you using?
CLI
Relevant log output