[BUG] - private test network running mostly latest 'master' hung overnight

bergr01 commented 4 years ago

I will attach a zip containing last 24 hours worth of logs from my relay node.

Network was running fairly smoothly for a few days, with about ten operators, using config and genesis settings as referenced here: https://github.com/cardano-community/guild-operators/tree/master/files/ptn0/files

Woke up this morning to find reports of things being stuck on block 36570, these were the last references to any block being added to chain in logs:

{"at":"2020-06-08T15:04:38.63Z","env":"1.13.0:9925e","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"HashHeader {unHashHeader = f02b9a6aa694304f56200e83399bd6eb1ade1d1e8b020fc83f8810769594b314}@146689","headers":[{"hash":"HashHeader {unHashHeader = f02b9a6aa694304f56200e83399bd6eb1ade1d1e8b020fc83f8810769594b314}","kind":"ShelleyBlock","blockNo":"36570","slotNo":"146689","prevhash":"HashHeader {unHashHeader = 7a487956c1b0d7a18e8973ae957d67238216417a9d2cee7b5d7eff3a916fc7f2}"}]},"app":[],"msg":"","pid":"14199","loc":null,"host":"myserver","sev":"Notice","thread":"51"}

{"at":"2020-06-08T15:04:49.63Z","env":"1.13.0:9925e","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.SwitchedToAFork","newtip":"HashHeader {unHashHeader = 1d03ccd7496854f297bcf5e39d84c68a9950bb17781a5aadaef31e0bcf29b822}@146689","headers":[{"hash":"HashHeader {unHashHeader = 1d03ccd7496854f297bcf5e39d84c68a9950bb17781a5aadaef31e0bcf29b822}","kind":"ShelleyBlock","blockNo":"36570","slotNo":"146689","prevhash":"HashHeader {unHashHeader = 7a487956c1b0d7a18e8973ae957d67238216417a9d2cee7b5d7eff3a916fc7f2}"}]},"app":[],"msg":"","pid":"14199","loc":null,"host":"myserver","sev":"Notice","thread":"51"}

After a restart of the relay node, I spotted the following in my new node instance logs:

{"at":"2020-06-08T22:11:10.42Z","env":"1.13.0:9925e","ns":["cardano.node.ChainSyncClient"],"data":{"exception":"ForkTooDeep Origin (Tip (SlotNo {unSlotNo = 146689}) (ShelleyHash {unShelleyHash = HashHeader {unHashHeader = 1d03ccd7496854f297bcf5e39d84c68a9950bb17781a5aadaef31e0bcf29b822}}) (BlockNo {unBlockNo = 36570})) (Tip (SlotNo {unSlotNo = 20904}) (ShelleyHash {unShelleyHash = HashHeader {unHashHeader = 23cacb5dceba1093d026a88fe51fe46560c82c5a6c98ab11508307331a6d4e71}}) (BlockNo {unBlockNo = 5306}))","kind":"ChainSyncClientEvent.TraceException"},"app":[],"msg":"","pid":"24513","loc":null,"host":"myserver","sev":"Warning","thread":"109"}

Please yell out if you need any more information to help with troubleshooting this.

bergr01 commented 4 years ago

hungNetwork.zip

bergr01 commented 4 years ago

Attached the logs from a relay on a hung network.

rdlrt commented 4 years ago

This would be a high priority as we may hit this obstacle in public network. A blockchain should not come to a halt, in any situation.

mrBliss commented 4 years ago

@bergr01: Near the end of the logs (node-0.json#L12596), I see the following ForkTooDeep exception (similar to the one you mention in the description).

{"at":"2020-06-08T22:09:55.44Z","env":"1.13.0:9925e","ns":["cardano.node.ChainSyncClient"],"data":{"exception":"ForkTooDeep Origin (Tip (SlotNo {unSlotNo = 146689}) (ShelleyHash {unShelleyHash = HashHeader {unHashHeader = 1d03ccd7496854f297bcf5e39d84c68a9950bb17781a5aadaef31e0bcf29b822}}) (BlockNo {unBlockNo = 36570})) (Tip (SlotNo {unSlotNo = 20244}) (ShelleyHash {unShelleyHash = HashHeader {unHashHeader = a563fca9c4ed3b553eb039c2b79b09193a7ac0ca721333f070b92bf2c240eb14}}) (BlockNo {unBlockNo = 5115}))","kind":"ChainSyncClientEvent.TraceException"},"app":[],"msg":"","pid":"14199","loc":null,"host":"myserver","sev":"Warning","thread":"79276"}

Let's zoom in on the exception and what it means:

ForkTooDeep
  Origin
  (Tip
    (SlotNo {unSlotNo = 146689})
    (ShelleyHash {unShelleyHash = HashHeader {unHashHeader = 1d03ccd7496854f297bcf5e39d84c68a9950bb17781a5aadaef31e0bcf29b822}})
    (BlockNo {unBlockNo = 36570}))
  (Tip
    (SlotNo {unSlotNo = 20244})
    (ShelleyHash {unShelleyHash = HashHeader {unHashHeader = a563fca9c4ed3b553eb039c2b79b09193a7ac0ca721333f070b92bf2c240eb14}})
    (BlockNo {unBlockNo = 5115}))

This exception is thrown when your node's ChainSyncClient connects to another node's ChainSyncServer and sees that the other node's chain forks off too far in the past, i.e., more than securityParam (= 72 according to genesis.json) blocks back from the tip of our current chain.

Concretely, the first Tip in the exception is the tip of your chain, i.e., block number 36,570, slot number 146,689, hash 1d03. The second Tip in the exception is the tip of the chain of the node you're connecting too, i.e., block number 5,115, slot number 20,244, hash a563. This tip is 36,570 - 5,115 = 30,555 blocks back from the tip of your chain, which is far more than the maximum of 72 blocks. So the node you're connecting to is so far behind that its chain is not interesting, we wouldn't be able to adopt any blocks on its chain anyway, so we disconnect. This is normal, it is very likely that this node is still syncing and hasn't yet caught up with the rest of the network.

This does not necessarily mean the network is stuck, it may appear so because you're simply not connecting to nodes that are fully synced. Maybe that's the cause? Can you check to which nodes you're connecting?

mrBliss commented 4 years ago

I have removed the bug label because this isn't a bug. We need more info about the nodes you're connecting to. We can add the label again when it appears to be a bug after all.

rdlrt commented 4 years ago

it may appear so because you're simply not connecting to nodes that are fully synced

When the issue occurred, it simultaneously made all (12-15) nodes on that chain s to be stuck at the same time. Ofcourse the network no longer lives because we moved on to newer genesis and thus, new network - but this occurred twice , and it will prolly remain a mystery to us why such a black swan event occurred back then, unless we can replicate this again.

Perhaps if we can gather some instructions of what all would be required to gather info next time it occurs, we can potentially archive this issue and reopen linking to this one?

mrBliss commented 4 years ago

Perhaps if we can gather some instructions of what all would be required to gather info next time it occurs, we can potentially archive this issue and reopen linking to this one?

Logs for all nodes that appear to be stuck, in particular the ForkTooDeep exceptions occurring at the end.

You linked to https://github.com/cardano-community/guild-operators/tree/master/files/ptn0/files, which is very helpful, but that's a link to master (HEAD in fact), it would be better to link to the exact revision, because master changes all the time.

Maybe somebody from the network team (I'm from the consensus team) can have a look at node-0.json to figure out what's going on exactly. cc: @coot @karknu.

The first message in the excerpt from node-0.json below indicates that the node tried to connect to another node with a different NetworkMagic, i.e., running a different version of the (test)net. Did the testnet switch to a different genesis at some point? The messages below say "Skipping peer", but I don't know why, somebody from the network team might. The last message is a DNS failure for relay1.ahlnet.nu ("AAAA lookup failed with ServerFailure"). They all seem a bit suspicious to me, but I'm sure somebody from the network team will be able to clarify.

{"at":"2020-06-08T22:09:53.28Z","env":"1.13.0:9925e","ns":["cardano.node.Handshake"],"data":{"event":"Send MsgRefuse (Refused NodeToNodeV_1 \"version data mismatch: NodeToNodeVersionData {networkMagic = NetworkMagic {unNetworkMagic = 4004}} /= NodeToNodeVersionData {networkMagic = NetworkMagic {unNetworkMagic = 4001}}\")","kind":"HandshakeTrace","bearer":"ConnectionId {localAddress = 0.0.0.0:6001, remoteAddress = 133.206.97.96:60648}"},"app":[],"msg":"","pid":"14199","loc":null,"host":"myserver","sev":"Info","thread":"79265"}
...
{"at":"2020-06-08T22:09:53.86Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Connection Attempt End, destination 88.99.83.86:6000 outcome: ConnectSuccess","dests":"[88.99.83.86:6000,127.0.0.1:6002,127.0.0.1:6003,127.0.0.1:6004,202.168.43.72:4091]","kind":"WithIPList SubscriptionTrace","localAddresses":"LocalAddresses {laIpv4 = Just 0.0.0.0:0, laIpv6 = Nothing, laUnix = Nothing}"},"app":[],"msg":"","pid":"14199","loc":null,"host":"myserver","sev":"Info","thread":"79266"}
{"at":"2020-06-08T22:09:53.86Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Skipping peer 127.0.0.1:6002","dests":"[88.99.83.86:6000,127.0.0.1:6002,127.0.0.1:6003,127.0.0.1:6004,202.168.43.72:4091]","kind":"WithIPList SubscriptionTrace","localAddresses":"LocalAddresses {laIpv4 = Just 0.0.0.0:0, laIpv6 = Nothing, laUnix = Nothing}"},"app":[],"msg":"","pid":"14199","loc":null,"host":"myserver","sev":"Info","thread":"69"}
{"at":"2020-06-08T22:09:53.86Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Skipping peer 127.0.0.1:6003","dests":"[88.99.83.86:6000,127.0.0.1:6002,127.0.0.1:6003,127.0.0.1:6004,202.168.43.72:4091]","kind":"WithIPList SubscriptionTrace","localAddresses":"LocalAddresses {laIpv4 = Just 0.0.0.0:0, laIpv6 = Nothing, laUnix = Nothing}"},"app":[],"msg":"","pid":"14199","loc":null,"host":"myserver","sev":"Info","thread":"69"}
{"at":"2020-06-08T22:09:53.86Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Skipping peer 127.0.0.1:6004","dests":"[88.99.83.86:6000,127.0.0.1:6002,127.0.0.1:6003,127.0.0.1:6004,202.168.43.72:4091]","kind":"WithIPList SubscriptionTrace","localAddresses":"LocalAddresses {laIpv4 = Just 0.0.0.0:0, laIpv6 = Nothing, laUnix = Nothing}"},"app":[],"msg":"","pid":"14199","loc":null,"host":"myserver","sev":"Info","thread":"69"}
{"at":"2020-06-08T22:09:53.86Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Skipping peer 202.168.43.72:4091","dests":"[88.99.83.86:6000,127.0.0.1:6002,127.0.0.1:6003,127.0.0.1:6004,202.168.43.72:4091]","kind":"WithIPList SubscriptionTrace","localAddresses":"LocalAddresses {laIpv4 = Just 0.0.0.0:0, laIpv6 = Nothing, laUnix = Nothing}"},"app":[],"msg":"","pid":"14199","loc":null,"host":"myserver","sev":"Info","thread":"69"}
...
{"at":"2020-06-08T22:09:58.99Z","env":"1.13.0:9925e","ns":["cardano.node.DnsResolver"],"data":{"event":"AAAA lookup failed with ServerFailure","kind":"DnsTrace","domain":"\"relay1.ahlnet.nu\""},"app":[],"msg":"","pid":"14199","loc":null,"host":"myserver","sev":"Error","thread":"79283"}
{"at":"2020-06-08T22:09:58.99Z","env":"1.13.0:9925e","ns":["cardano.node.DnsSubscription"],"data":{"event":"Failed to start all required subscriptions","kind":"SubscriptionTrace","domain":"\"relay1.ahlnet.nu\""},"app":[],"msg":"","pid":"14199","loc":null,"host":"myserver","sev":"Warning","thread":"76"}

bergr01 commented 4 years ago

Not sure why you're focusing on node-0.json - that one mostly contains logs from the post-hung restart. As I mentioned in my original post, the last block added to the chain was referenced in node-0-20200608144223.json at around 2020-06-08T15:04:38.63Z so I am guessing if there is going to be anything useful it will be around that area? I can see a bunch of 'TraceRolledBack', 'TraceDownloadedHeader' and 'TraceChainSyncServerReadBlocked.RollBack' entries for example... interestingly the latter stop after 15:04:49 around the time there is a SwitchedToAFork followed by a flurry of 'ChainSyncServerEvent.TraceChainSyncServerRead.AddBlock' messages that reference same Point, hash header, pid but different thread... that looks rather strange to my uneducated eye.

In that entire log file, spanning 14:42:23 to 17:38:48 (so just under 3 hours), we have:

2 TraceLedgerEvent 160 TraceAddBlockEvent 272 TraceFoundIntersection 316 TraceRolledBack 536 TraceDownloadedHeader 1411 TraceReaderEvent 2074 TraceChainSyncServerReadBlocked 7960 TraceChainSyncServerRead

but in the subset of that log file from last block added till end of file:

1 TraceDownloadedHeader 272 TraceFoundIntersection 273 TraceRolledBack 1409 TraceReaderEvent 6981 TraceChainSyncServerRead

So TraceReaderEvents, TraceFoundIntersection log messages essentially started, while TraceChainSyncServerReadBlocked, TraceDownloadedHeader, TraceAddBlockEvent stopped...

Presence of some peers on the network with wrong networkMagic has been ongoing on this particular network, shouldn't cause problems unless it's a majority of peers I presume... same goes for DNS resolution issues involving one or two peers out of a dozen.

bergr01 commented 4 years ago

Taking it a bit further, the first 20 minutes of that log file (i.e. before last block added) feature ErrorPolicySuspend peer just 6 times, and zero "event: AAAA" while after hang it's 404 instances of event AAAA and 1694 instances of ErrorPolicySuspendPeer. So yeah, let's hope it wasn't ahlnet's DNS issues that caused mass extinction

Also, looking at 'before hang' chunk =>

grep -o ".event\":\"Application[^,]" bef.txt | sort {"at":"2020-06-08T14:59:55.32Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 88.99.83.86:6000 ExceededTimeLimit" {"at":"2020-06-08T15:00:09.06Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 88.99.83.86:6000 ExceededTimeLimit" {"at":"2020-06-08T15:02:20.62Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 88.99.83.86:6000 ExceededTimeLimit" {"at":"2020-06-08T15:02:40.63Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 88.99.83.86:6000 ExceededTimeLimit"

looking at after chunk ->

grep -o ".event\":\"Application[^,]" aft.txt | sort {"at":"2020-06-08T15:04:52.17Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 88.99.83.86:6000 ExceededTimeLimit" {"at":"2020-06-08T15:05:20.00Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 88.99.83.86:6000 ExceededTimeLimit" {"at":"2020-06-08T15:05:54.19Z","env":"1.13.0:9925e","ns":["cardano.node.DnsSubscription"],"data":{"event":"Application Exception: 178.174.235.99:2101 ExceededTimeLimit" {"at":"2020-06-08T15:05:57.41Z","env":"1.13.0:9925e","ns":["cardano.node.DnsSubscription"],"data":{"event":"Application Exception: 54.189.61.80:3009 ExceededTimeLimit" {"at":"2020-06-08T15:06:05.28Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 127.0.0.1:6003 ExceededTimeLimit" {"at":"2020-06-08T15:07:09.41Z","env":"1.13.0:9925e","ns":["cardano.node.DnsSubscription"],"data":{"event":"Application Exception: 54.189.61.80:3009 ExceededTimeLimit" {"at":"2020-06-08T15:07:23.72Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 88.99.83.86:6000 ExceededTimeLimit" {"at":"2020-06-08T15:08:33.72Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 127.0.0.1:6003 ExceededTimeLimit"

What causes 'ExceededTimeLimit' exception and what is the limit and is it configurable?

Interestingly enough, entire node-0-20200608130700.json (that spanned an hour and 35 minutes before last 20 minutes before hung) features zero event: Application exceptions.

mrBliss commented 4 years ago

Not sure why you're focusing on node-0.json - that one mostly contains logs from the post-hung restart. As I mentioned in my original post, the last block added to the chain was referenced in node-0-20200608144223.json at around 2020-06-08T15:04:38.63Z so I am guessing if there is going to be anything useful it will be around that area? I can see a bunch of 'TraceRolledBack', 'TraceDownloadedHeader' and 'TraceChainSyncServerReadBlocked.RollBack' entries for example... interestingly the latter stop after 15:04:49 around the time there is a SwitchedToAFork followed by a flurry of 'ChainSyncServerEvent.TraceChainSyncServerRead.AddBlock' messages that reference same Point, hash header, pid but different thread... that looks rather strange to my uneducated eye.

In that entire log file, spanning 14:42:23 to 17:38:48 (so just under 3 hours), we have:

2 TraceLedgerEvent 160 TraceAddBlockEvent 272 TraceFoundIntersection 316 TraceRolledBack 536 TraceDownloadedHeader 1411 TraceReaderEvent 2074 TraceChainSyncServerReadBlocked 7960 TraceChainSyncServerRead

but in the subset of that log file from last block added till end of file:

1 TraceDownloadedHeader 272 TraceFoundIntersection 273 TraceRolledBack 1409 TraceReaderEvent 6981 TraceChainSyncServerRead

So TraceReaderEvents, TraceFoundIntersection log messages essentially started, while TraceChainSyncServerReadBlocked, TraceDownloadedHeader, TraceAddBlockEvent stopped...

Presence of some peers on the network with wrong networkMagic has been ongoing on this particular network, shouldn't cause problems unless it's a majority of peers I presume... same goes for DNS resolution issues involving one or two peers out of a dozen.

TraceChainSyncServerReadBlocked, TraceChainSyncServerRead, and TraceReaderEvent are traced by the ChainSyncServer, which sends the headers of your node's chain to other nodes.

TraceFoundIntersection, TraceRolledBack, and TraceDownloadedHeader are traced by the ChainSyncClient, which receives headers from other nodes' chains. Based on these headers, we choose which blocks to download, which are then added to the ChainDB, which will perform chain selection and validation on them and adopt blocks (TraceAddBlockEvent).

So before, the node was receiving headers and blocks, and sending headers, but after, the node was only sending headers to other nodes.

This matches exactly with what you're saying.

Taking it a bit further, the first 20 minutes of that log file (i.e. before last block added) feature ErrorPolicySuspend peer just 6 times, and zero "event: AAAA" while after hang it's 404 instances of event AAAA and 1694 instances of ErrorPolicySuspendPeer. So yeah, let's hope it wasn't ahlnet's DNS issues that caused mass extinction

Also, looking at 'before hang' chunk =>

grep -o ".event":"Application[^,]" bef.txt | sort {"at":"2020-06-08T14:59:55.32Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 88.99.83.86:6000 ExceededTimeLimit" {"at":"2020-06-08T15:00:09.06Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 88.99.83.86:6000 ExceededTimeLimit" {"at":"2020-06-08T15:02:20.62Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 88.99.83.86:6000 ExceededTimeLimit" {"at":"2020-06-08T15:02:40.63Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 88.99.83.86:6000 ExceededTimeLimit"

looking at after chunk ->

grep -o ".event":"Application[^,]" aft.txt | sort {"at":"2020-06-08T15:04:52.17Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 88.99.83.86:6000 ExceededTimeLimit" {"at":"2020-06-08T15:05:20.00Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 88.99.83.86:6000 ExceededTimeLimit" {"at":"2020-06-08T15:05:54.19Z","env":"1.13.0:9925e","ns":["cardano.node.DnsSubscription"],"data":{"event":"Application Exception: 178.174.235.99:2101 ExceededTimeLimit" {"at":"2020-06-08T15:05:57.41Z","env":"1.13.0:9925e","ns":["cardano.node.DnsSubscription"],"data":{"event":"Application Exception: 54.189.61.80:3009 ExceededTimeLimit" {"at":"2020-06-08T15:06:05.28Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 127.0.0.1:6003 ExceededTimeLimit" {"at":"2020-06-08T15:07:09.41Z","env":"1.13.0:9925e","ns":["cardano.node.DnsSubscription"],"data":{"event":"Application Exception: 54.189.61.80:3009 ExceededTimeLimit" {"at":"2020-06-08T15:07:23.72Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 88.99.83.86:6000 ExceededTimeLimit" {"at":"2020-06-08T15:08:33.72Z","env":"1.13.0:9925e","ns":["cardano.node.IpSubscription"],"data":{"event":"Application Exception: 127.0.0.1:6003 ExceededTimeLimit"

What causes 'ExceededTimeLimit' exception and what is the limit and is it configurable?

Interestingly enough, entire node-0-20200608130700.json (that spanned an hour and 35 minutes before last 20 minutes before hung) features zero event: Application exceptions.

This is interesting, I recall the network team merging something related to this recently: https://github.com/input-output-hk/ouroboros-network/pull/2249. @karknu do you think that PR should fix the problem?

karknu commented 4 years ago

This is interesting, I recall the network team merging something related to this recently: input-output-hk/ouroboros-network#2249. @karknu do you think that PR should fix the problem?

It does looks like it could be the issue fixed by #2249.

coot commented 4 years ago

Is this still live?

rdlrt commented 4 years ago

@coot The old network itself is down, we're trying to see if the issue reoccurs on 1.15 now

vix-io commented 3 years ago

hi @bergr01 can you still reproduce this?

Jimbo4350 commented 3 years ago

Closing. Re-open if necessary cc: @rdlrt @bergr01

IntersectMBO / cardano-node

[BUG] - private test network running mostly latest 'master' hung overnight #1215