Closed varasev closed 1 year ago
If I remove a non-validator node using kill -15 $(lsof -t -i:8545)
(after the block 15), I see that the latest (6-th) node starts to consume 50% of CPU.
Hi @varasev. I am not able to currently reproduce this on our VM. When we release the new version with required fixes, please do check if this is still an issue.
Hi @varasev. I am not able to currently reproduce this on our VM. When we release the new version with required fixes, please do check if this is still an issue.
Ok, will do
Unfortunately, it still occurs. Here is the screen video recorded on Ubuntu 16.04: screen.mp4.zip
@tkstanczak it tries to connect to that peer on localhost in a loop. We probably need some circut breaker for it. Logs:
2020-09-08 14:18:12.9215||NetworkTrace| Initializing [Session|Out|New|30316->127.0.0.1:30305] channel 2020-09-08 14:18:12.9215||NetworkTrace| [Session|Out|New|30316->127.0.0.1:30305] created in peer manager 2020-09-08 14:18:12.9215|PROCESS OUTGOING 0xc1cc59ddcee5d758f736e52f6fc232a21db7db98fd14e694bfcc39e8c480169c0d9ce2f011128c0fc03d210c40312bd1d23a656907fc6a8be00e7b8014ebbacd 2020-09-08 14:18:12.9215|ADDING [Session|Out|New|30316->127.0.0.1:30305] [Peer|127.0.0.1:30305||] 2020-09-08 14:18:12.9215||NetworkTrace| 127.0.0.1:30305 already in active peers 2020-09-08 14:18:12.9215|Channel Registered 2020-09-08 14:18:12.9215||NetworkTrace| 127.0.0.1:30305 error when OUT connecting System.AggregateException: One or more errors occurred. (LogError connecting to 127.0.0.1:30305) ---> DotNetty.Transport.Channels.ConnectException: LogError connecting to 127.0.0.1:30305 ---> System.Net.Sockets.SocketException (111): Connection refused at DotNetty.Transport.Channels.Sockets.SocketChannelAsyncOperation.Validate() at DotNetty.Transport.Channels.Sockets.TcpSocketChannel.DoFinishConnect(SocketChannelAsyncOperation operation) at DotNetty.Transport.Channels.Sockets.AbstractSocketChannel.AbstractSocketUnsafe.FinishConnect(SocketChannelAsyncOperation operation) --- End of inner exception stack trace --- at DotNetty.Transport.Bootstrapping.Bootstrap.DoResolveAndConnectAsync(EndPoint remoteAddress, EndPoint localAddress) --- End of inner exception stack trace --- 2020-09-08 14:18:12.9215|Cannot connect to peer [TargetUnreachable]: 127.0.0.1:30305 2020-09-08 14:18:12.9215|Connecting to 4100 rep node - False, ACTIVE: 6, CAND: 6 2020-09-08 14:18:12.9215|Timeout, doing additional disconnect: 0xc1cc59ddcee5d758f736e52f6fc232a21db7db98fd14e694bfcc39e8c480169c0d9ce2f011128c0fc03d210c40312bd1d23a656907fc6a8be00e7b8014ebbacd 2020-09-08 14:18:12.9215|[Session|Out|Disconnecting|30316->127.0.0.1:30305] invoking 'Disconnecting' event ReceiveMessageTimeout Local timeout 2020-09-08 14:18:12.9215||NetworkTrace| [Session|Out|Disconnecting|30316->127.0.0.1:30305] channel disconnected 2020-09-08 14:18:12.9215|[Session|Out|Disconnecting|30316->127.0.0.1:30305] already disconnected ClientQuitting Remote 2020-09-08 14:18:12.9215|Channel Unregistered 2020-09-08 14:18:12.9215|Handshake with 0xc1cc59ddcee5d758f736e52f6fc232a21db7db98fd14e694bfcc39e8c480169c0d9ce2f011128c0fc03d210c40312bd1d23a656907fc6a8be00e7b8014ebbacd @ finished. Removing NettyHandshakeHandler from the pipeline 2020-09-08 14:18:12.9215||NetworkTrace| [Session|Out|Disconnected|30316->127.0.0.1:30305] disconnected event ReceiveMessageTimeout Local 2020-09-08 14:18:12.9215||NetworkTrace| [Session|Out|Disconnected|30316->127.0.0.1:30305] closing 2020-09-08 14:18:12.9215||NetworkTrace| peer disconnected event in PeerManager - [Session|Out|Disconnected|30316->127.0.0.1:30305] ReceiveMessageTimeout Local 2020-09-08 14:18:12.9215|Adding a stored candidate peer 127.0.0.1:30305 2020-09-08 14:18:12.9215|DEACTIVATING IF DISCONNECTED [Peer|127.0.0.1:30305||[Session|Out|Disconnected|30316->127.0.0.1:30305]] 2020-09-08 14:18:12.9215|DEACTIVATING IF DISCONNECTED [Peer|127.0.0.1:30305||] 2020-09-08 14:18:12.9215|Error while disconnecting on channel on [Session|Out|Disconnected|30316->127.0.0.1:30305] : System.AggregateException: One or more errors occurred. (Transport endpoint is not connected) ---> System.Net.Sockets.SocketException (107): Transport endpoint is not connected at System.Net.Sockets.Socket.UpdateStatusAfterSocketErrorAndThrowException(SocketError error, String callerName) at System.Net.Sockets.Socket.Shutdown(SocketShutdown how) at DotNetty.Transport.Channels.Sockets.TcpSocketChannel.DoClose() at DotNetty.Transport.Channels.AbstractChannel.AbstractUnsafe.DoClose0(TaskCompletionSource promise) --- End of inner exception stack trace --- 2020-09-08 14:18:12.9215|Running peer update loop 76600 - active: 5 | candidates : 6 2020-09-08 14:18:12.9215||NetworkTrace| 127.0.0.1:30305 added to active peers - upgrading candidate 2020-09-08 14:18:12.9215|CONNECTING TO [Peer|127.0.0.1:30305||] 2020-09-08 14:18:12.9215||NetworkTrace| 127.0.0.1:30305 initiating OUT connection 2020-09-08 14:18:12.9215||NetworkTrace| Initializing [Session|Out|New|30316->127.0.0.1:30305] channel 2020-09-08 14:18:12.9215||NetworkTrace| [Session|Out|New|30316->127.0.0.1:30305] created in peer manager 2020-09-08 14:18:12.9215|PROCESS OUTGOING 0xc1cc59ddcee5d758f736e52f6fc232a21db7db98fd14e694bfcc39e8c480169c0d9ce2f011128c0fc03d210c40312bd1d23a656907fc6a8be00e7b8014ebbacd 2020-09-08 14:18:12.9215|ADDING [Session|Out|New|30316->127.0.0.1:30305] [Peer|127.0.0.1:30305||] 2020-09-08 14:18:12.9215||NetworkTrace| 127.0.0.1:30305 already in active peers 2020-09-08 14:18:12.9215|Channel Registered 2020-09-08 14:18:12.9215||NetworkTrace| 127.0.0.1:30305 error when OUT connecting System.AggregateException: One or more errors occurred. (LogError connecting to 127.0.0.1:30305) ---> DotNetty.Transport.Channels.ConnectException: LogError connecting to 127.0.0.1:30305 ---> System.Net.Sockets.SocketException (111): Connection refused at DotNetty.Transport.Channels.Sockets.SocketChannelAsyncOperation.Validate() at DotNetty.Transport.Channels.Sockets.TcpSocketChannel.DoFinishConnect(SocketChannelAsyncOperation operation) at DotNetty.Transport.Channels.Sockets.AbstractSocketChannel.AbstractSocketUnsafe.FinishConnect(SocketChannelAsyncOperation operation) --- End of inner exception stack trace --- at DotNetty.Transport.Bootstrapping.Bootstrap.DoResolveAndConnectAsync(EndPoint remoteAddress, EndPoint localAddress) --- End of inner exception stack trace --- 2020-09-08 14:18:12.9215|Cannot connect to peer [TargetUnreachable]: 127.0.0.1:30305
@varasev I wouldn't say its that important for now. It only causes high CPU due to both nodes being on same machine.
is it in the list of static nodes?
@tkstanczak yes
I tried to launch https://github.com/poanetwork/posdao-test-setup with Nethermind (see instructions in the README).
If I turn off one of the validators, the rest validators begin to heavily load CPU for some reason.
Steps to reproduce:
Run
posdao-test-setup
withnpm run all-nethermind
.Between blocks 63-70 perform the command
kill -15 $(lsof -t -i:8543)
- it will stop the third node.After block 80 (when a new staking epoch starts), the third validator is out of consensus due to the test scenario as expected, but all CPU cores continue to be fully loaded (by 100%).
After I turn on the third node again (using
../nethermind/bin/Nethermind.Runner --config "./config/node3.nethermind.json" >> "./data/node3/log" 2>&1 &
), the CPU load is gone.After that, if I try to
kill -15 $(lsof -t -i:8543)
again, the CPU begins to be fully loaded again.