NethermindEth / nethermind

A robust execution client for Ethereum node operators.
https://nethermind.io/nethermind-client
GNU General Public License v3.0
1.26k stars 433 forks source link

heavily load CPU when static node not available #2191

Closed varasev closed 1 year ago

varasev commented 4 years ago

I tried to launch https://github.com/poanetwork/posdao-test-setup with Nethermind (see instructions in the README).

If I turn off one of the validators, the rest validators begin to heavily load CPU for some reason.

Steps to reproduce:

  1. Run posdao-test-setup with npm run all-nethermind.

  2. Between blocks 63-70 perform the command kill -15 $(lsof -t -i:8543) - it will stop the third node.

After block 80 (when a new staking epoch starts), the third validator is out of consensus due to the test scenario as expected, but all CPU cores continue to be fully loaded (by 100%).

After I turn on the third node again (using ../nethermind/bin/Nethermind.Runner --config "./config/node3.nethermind.json" >> "./data/node3/log" 2>&1 &), the CPU load is gone.

After that, if I try to kill -15 $(lsof -t -i:8543) again, the CPU begins to be fully loaded again.

varasev commented 4 years ago

If I remove a non-validator node using kill -15 $(lsof -t -i:8545) (after the block 15), I see that the latest (6-th) node starts to consume 50% of CPU.

LukaszRozmej commented 4 years ago

Hi @varasev. I am not able to currently reproduce this on our VM. When we release the new version with required fixes, please do check if this is still an issue.

varasev commented 4 years ago

Hi @varasev. I am not able to currently reproduce this on our VM. When we release the new version with required fixes, please do check if this is still an issue.

Ok, will do

varasev commented 4 years ago

Unfortunately, it still occurs. Here is the screen video recorded on Ubuntu 16.04: screen.mp4.zip

LukaszRozmej commented 4 years ago

@tkstanczak it tries to connect to that peer on localhost in a loop. We probably need some circut breaker for it. Logs:

2020-09-08 14:18:12.9215||NetworkTrace| Initializing [Session|Out|New|30316->127.0.0.1:30305] channel 2020-09-08 14:18:12.9215||NetworkTrace| [Session|Out|New|30316->127.0.0.1:30305] created in peer manager 2020-09-08 14:18:12.9215|PROCESS OUTGOING 0xc1cc59ddcee5d758f736e52f6fc232a21db7db98fd14e694bfcc39e8c480169c0d9ce2f011128c0fc03d210c40312bd1d23a656907fc6a8be00e7b8014ebbacd 2020-09-08 14:18:12.9215|ADDING [Session|Out|New|30316->127.0.0.1:30305] [Peer|127.0.0.1:30305||] 2020-09-08 14:18:12.9215||NetworkTrace| 127.0.0.1:30305 already in active peers 2020-09-08 14:18:12.9215|Channel Registered 2020-09-08 14:18:12.9215||NetworkTrace| 127.0.0.1:30305 error when OUT connecting System.AggregateException: One or more errors occurred. (LogError connecting to 127.0.0.1:30305) ---> DotNetty.Transport.Channels.ConnectException: LogError connecting to 127.0.0.1:30305 ---> System.Net.Sockets.SocketException (111): Connection refused at DotNetty.Transport.Channels.Sockets.SocketChannelAsyncOperation.Validate() at DotNetty.Transport.Channels.Sockets.TcpSocketChannel.DoFinishConnect(SocketChannelAsyncOperation operation) at DotNetty.Transport.Channels.Sockets.AbstractSocketChannel.AbstractSocketUnsafe.FinishConnect(SocketChannelAsyncOperation operation) --- End of inner exception stack trace --- at DotNetty.Transport.Bootstrapping.Bootstrap.DoResolveAndConnectAsync(EndPoint remoteAddress, EndPoint localAddress) --- End of inner exception stack trace --- 2020-09-08 14:18:12.9215|Cannot connect to peer [TargetUnreachable]: 127.0.0.1:30305 2020-09-08 14:18:12.9215|Connecting to 4100 rep node - False, ACTIVE: 6, CAND: 6 2020-09-08 14:18:12.9215|Timeout, doing additional disconnect: 0xc1cc59ddcee5d758f736e52f6fc232a21db7db98fd14e694bfcc39e8c480169c0d9ce2f011128c0fc03d210c40312bd1d23a656907fc6a8be00e7b8014ebbacd 2020-09-08 14:18:12.9215|[Session|Out|Disconnecting|30316->127.0.0.1:30305] invoking 'Disconnecting' event ReceiveMessageTimeout Local timeout 2020-09-08 14:18:12.9215||NetworkTrace| [Session|Out|Disconnecting|30316->127.0.0.1:30305] channel disconnected 2020-09-08 14:18:12.9215|[Session|Out|Disconnecting|30316->127.0.0.1:30305] already disconnected ClientQuitting Remote 2020-09-08 14:18:12.9215|Channel Unregistered 2020-09-08 14:18:12.9215|Handshake with 0xc1cc59ddcee5d758f736e52f6fc232a21db7db98fd14e694bfcc39e8c480169c0d9ce2f011128c0fc03d210c40312bd1d23a656907fc6a8be00e7b8014ebbacd @ finished. Removing NettyHandshakeHandler from the pipeline 2020-09-08 14:18:12.9215||NetworkTrace| [Session|Out|Disconnected|30316->127.0.0.1:30305] disconnected event ReceiveMessageTimeout Local 2020-09-08 14:18:12.9215||NetworkTrace| [Session|Out|Disconnected|30316->127.0.0.1:30305] closing 2020-09-08 14:18:12.9215||NetworkTrace| peer disconnected event in PeerManager - [Session|Out|Disconnected|30316->127.0.0.1:30305] ReceiveMessageTimeout Local 2020-09-08 14:18:12.9215|Adding a stored candidate peer 127.0.0.1:30305 2020-09-08 14:18:12.9215|DEACTIVATING IF DISCONNECTED [Peer|127.0.0.1:30305||[Session|Out|Disconnected|30316->127.0.0.1:30305]] 2020-09-08 14:18:12.9215|DEACTIVATING IF DISCONNECTED [Peer|127.0.0.1:30305||] 2020-09-08 14:18:12.9215|Error while disconnecting on channel on [Session|Out|Disconnected|30316->127.0.0.1:30305] : System.AggregateException: One or more errors occurred. (Transport endpoint is not connected) ---> System.Net.Sockets.SocketException (107): Transport endpoint is not connected at System.Net.Sockets.Socket.UpdateStatusAfterSocketErrorAndThrowException(SocketError error, String callerName) at System.Net.Sockets.Socket.Shutdown(SocketShutdown how) at DotNetty.Transport.Channels.Sockets.TcpSocketChannel.DoClose() at DotNetty.Transport.Channels.AbstractChannel.AbstractUnsafe.DoClose0(TaskCompletionSource promise) --- End of inner exception stack trace --- 2020-09-08 14:18:12.9215|Running peer update loop 76600 - active: 5 | candidates : 6 2020-09-08 14:18:12.9215||NetworkTrace| 127.0.0.1:30305 added to active peers - upgrading candidate 2020-09-08 14:18:12.9215|CONNECTING TO [Peer|127.0.0.1:30305||] 2020-09-08 14:18:12.9215||NetworkTrace| 127.0.0.1:30305 initiating OUT connection 2020-09-08 14:18:12.9215||NetworkTrace| Initializing [Session|Out|New|30316->127.0.0.1:30305] channel 2020-09-08 14:18:12.9215||NetworkTrace| [Session|Out|New|30316->127.0.0.1:30305] created in peer manager 2020-09-08 14:18:12.9215|PROCESS OUTGOING 0xc1cc59ddcee5d758f736e52f6fc232a21db7db98fd14e694bfcc39e8c480169c0d9ce2f011128c0fc03d210c40312bd1d23a656907fc6a8be00e7b8014ebbacd 2020-09-08 14:18:12.9215|ADDING [Session|Out|New|30316->127.0.0.1:30305] [Peer|127.0.0.1:30305||] 2020-09-08 14:18:12.9215||NetworkTrace| 127.0.0.1:30305 already in active peers 2020-09-08 14:18:12.9215|Channel Registered 2020-09-08 14:18:12.9215||NetworkTrace| 127.0.0.1:30305 error when OUT connecting System.AggregateException: One or more errors occurred. (LogError connecting to 127.0.0.1:30305) ---> DotNetty.Transport.Channels.ConnectException: LogError connecting to 127.0.0.1:30305 ---> System.Net.Sockets.SocketException (111): Connection refused at DotNetty.Transport.Channels.Sockets.SocketChannelAsyncOperation.Validate() at DotNetty.Transport.Channels.Sockets.TcpSocketChannel.DoFinishConnect(SocketChannelAsyncOperation operation) at DotNetty.Transport.Channels.Sockets.AbstractSocketChannel.AbstractSocketUnsafe.FinishConnect(SocketChannelAsyncOperation operation) --- End of inner exception stack trace --- at DotNetty.Transport.Bootstrapping.Bootstrap.DoResolveAndConnectAsync(EndPoint remoteAddress, EndPoint localAddress) --- End of inner exception stack trace --- 2020-09-08 14:18:12.9215|Cannot connect to peer [TargetUnreachable]: 127.0.0.1:30305

LukaszRozmej commented 4 years ago

@varasev I wouldn't say its that important for now. It only causes high CPU due to both nodes being on same machine.

tkstanczak commented 4 years ago

is it in the list of static nodes?

LukaszRozmej commented 4 years ago

@tkstanczak yes