NethermindEth / nethermind

A robust execution client for Ethereum node operators.
https://nethermind.io/nethermind-client
GNU General Public License v3.0
1.28k stars 438 forks source link

"Failed to notify enode" - root cause analysis of NullReferenceException #4722

Open kamilchodola opened 2 years ago

kamilchodola commented 2 years ago

Describe the bug From time to time it happens that we have multiple failures on SmokeTests runs with "Failed to notify enode" and NullReferenceException. Need to investigate and fix a root-cause for that since it was not appearing most probably on pre-merge version.

This is a minor issue - no issues with syncing happens because of that. image

kamilchodola commented 2 years ago

@smartprogrammer93 @MarekM25 Any info there? It still appears from time to time on nodes and just wanted to ensure if this can make any issue on node.

smartprogrammer93 commented 2 years ago

I will check on it tomorrow @kamilchodola

kamilchodola commented 2 years ago

@smartprogrammer93 Great! Thanks

kamilchodola commented 2 years ago

https://seq.nethermind.io/#/events?filter=Contains(NodeName,%20'Smoke-Tests-Snap-144-FastSync-goerli')%20and%20not%20%22Big%20Snappy%20messag%22%20and%20not%20%22block%20producer%20%26%20sealer%22&from=2022-10-21T23:50:00.000Z&to=2022-10-22T00:00:00.000Z

Appeared there on goerli on current smoke tests

smartprogrammer93 commented 1 year ago

Hey @kamilchodola ,

Let me know if you still see this issue from a version after https://github.com/NethermindEth/nethermind/pull/4874 is merged.

kamilchodola commented 1 year ago

@smartprogrammer93 I can see that one again but only once in one node. https://seq.nethermind.io/#/events?filter=Contains(NodeName,%20'Smoke-Tests-master125v2-goerli-lighthouse')%20and%20%22failed%20to%22 image

kamilchodola commented 1 year ago

@smartprogrammer93 Got spammed with those messages again image Unfortunately I got spammed with Debug logs and app removed file with logs for this specific situation - but seems like it is getting a bit stronger now

smartprogrammer93 commented 1 year ago

Unfortunately, i am still not sure of the reason behind this exception being thrown. I tried to investigate it intensively already but reached no solution. I will try to dive more into it tomorrow.

kamilchodola commented 1 year ago

@smartprogrammer93 Will try to reproduce it or will be more careful about those logs and will try to catch debug logs for You 0 maybe that would help.

smartprogrammer93 commented 1 year ago

@kamilchodola are we still seeing these?

smartprogrammer93 commented 1 year ago

@kamilchodola let me know if you are still seeing these exceptions or not. If not i will close this one.

kamilchodola commented 1 year ago

It still happens from time to time especially on goerli nodes but very rarely... So except normal logs it is hard to catch debug logs becuase those may be already overriden.

kamilchodola commented 1 year ago

@smartprogrammer93 Just happened again... Wondering about problems it may cause for us or network in such case. This time it was on mainnet-lighthouse pair.

image
smartprogrammer93 commented 1 year ago

@kamilchodola same stack trace (exception details)?

kamilchodola commented 1 year ago

@smartprogrammer93 Yeah - looks slightly different:

image
MarekM25 commented 10 months ago

@smartprogrammer93 still happening

System.NullReferenceException: Object reference not set to an instance of an object.
   at Nethermind.Network.P2P.ProtocolHandlers.SyncPeerProtocolHandlerBase.TxsToSendAndMarkAsNotified(IEnumerable`1 txs, Boolean sendFullTx)+MoveNext() in /_/src/Nethermind/Nethermind.Network/P2P/ProtocolHandlers/SyncPeerProtocolHandlerBase.cs:line 229
   at Nethermind.Network.P2P.Subprotocols.Eth.V65.Eth65ProtocolHandler.SendNewTransactionsCore(IEnumerable`1 txs, Boolean sendFullTx) in /_/src/Nethermind/Nethermind.Network/P2P/Subprotocols/Eth/V65/Eth65ProtocolHandler.cs:line 167
   at Nethermind.TxPool.TxBroadcaster.Notify(ITxPoolPeer peer, IEnumerable`1 txs, Boolean sendFullTx) in /_/src/Nethermind/Nethermind.TxPool/TxBroadcaster.cs:line 310
smartprogrammer93 commented 10 months ago

Yup, @MarekM25 cant figure out the reason. Will try to investigate again once possible

smartprogrammer93 commented 9 months ago

i spend a couple more hours on this, cant find a way for a null Tx ref to reach this point of the code. I traced it back all the way to TxPool. Only solution i see (workaround) is to check if Tx is null before trying to read it's hash. @MarekM25 let me know if you want me to follow this approach, or any other possible action.