IntersectMBO / cardano-node

The core component that is used to participate in a Cardano decentralised blockchain.
https://cardano.org
Apache License 2.0
3.06k stars 721 forks source link

[BUG] - BP running 1.34.1 drops connection to relay running 1.35.1 and never restores it w/o bouncing relay #4226

Closed nemo83 closed 1 year ago

nemo83 commented 2 years ago

External

Area Stake pool: network connectivity issues between BP and Relay running different versions of the node, 1.35.1 vs 1.34.1

Summary Hello, this is Giovanni and I operate EASY1 Stakepool.

I've recently replaced one of the three relays I run with the latest 1.35.1.

In the past few days I've noticed the BP repeadetly loosing connection to the relay and failing to re-enstablishing it. The only way to connect the two again is restarting the relay.

I can't tell if relay-to-relay there is a similar issue, but should there be, this could potentially affect network connectivity among Stakepool upgrading to 1.35.x, while other might be on previous version and dramatically affect block propagation.

I've found at least two more SPOs w/ same issues.

Steps to reproduce Hard to reproduce, it just happens randomly

Expected behavior BP and relay shouldn't lost connection to each other

System info (please complete the following information): BP running 1.34.1 (official docker image) Relay running 1.35.1 (official docker image)

Screenshot 2022-07-23 at 14 58 22
jmalcolea commented 2 years ago

ALTZ pool - same issue. Additional data: the metric cardano_node_metrics_connectedPeers_int reflect the lost of connection but netstat or gLiveview show simetric outcoming and incoming connections including the 1.35.1 relay. The BP node log shows different message for 1.34.1 relay and 1.35.1 one: {"host":"ip-xxx-x","pid":"16xxx","loc":null,"at":"2022-07-22T17:34:00.81Z","ns":["cardano.node.BlockFetchDecision"],"sev":"Info","env":"1.34.1:a357c","data":{"kind":"PeersFetch","peers":[{"peer":{"remote":{ "port":"xxxx","addr":"<relay 1.35.1 node IP>"},"local":{"port":"xxxxx","addr":"<BP 1.34.1 node IP>"}},"kind":"FetchDecision declined","declined":"FetchDeclineChainNotPlausible"},{"peer":{"remote":{"port":"xxxx","addr":"<relay 1.34.1 node IP"},"local":{"port":"xxxxx","addr":"<BP 1.34.1 node IP"}},"kind":"FetchDecision results","length":"1"}]},"msg":"","thread":"xxx","app":[]} I don't know if it is related to this issue.

Dark345 commented 2 years ago

same issue here

image

reqlez commented 2 years ago

I have all these issues fixed finally. But yes I have seen them...

What i had to do is make sure every relay and BP is under a separate public IP, and also had to play around with NAT settings.

I also enabled firewall rules between my relays, so my own relays never connect to each other.

reqlez commented 2 years ago

Oh, sorry... this was between 1.34.1 and 1.35.1 ... I had this happening between 1.35.1 relays...

jmalcolea commented 2 years ago

Anyway, it has been recommended to have interconnected the own relays. I don't understand how we have to avoid to connect our 1.35.x relays to each other.

reqlez commented 2 years ago

Anyway, it has been recommended to have interconnected the own relays. I don't understand how we have to avoid to connect our 1.35.x relays to each other.

Well, in my test I had choice, have interconnected relays, or have relays missing from the block producer, so i decided my block producer is more important ;-)

stefiix92 commented 2 years ago

Little off-topic question, but how can I enable cardano_node_metrics_connectedPeers_int ? I don't have such a metric in Prometheus metrics

nemo83 commented 2 years ago

Little off-topic question, but how can I enable cardano_node_metrics_connectedPeers_int ? I don't have such a metric in Prometheus metrics

Hey, good question, you need to update your config.json, there is one of this flag that needs to be set to true

{
  "AlonzoGenesisFile": "mainnet-alonzo-genesis.json",
  "AlonzoGenesisHash": "7e94a15f55d1e82d10f09203fa1d40f8eede58fd8066542cf6566008068ed874",
  "ApplicationName": "cardano-sl",
  "ApplicationVersion": 1,
  "ByronGenesisFile": "mainnet-byron-genesis.json",
  "ByronGenesisHash": "5f20df933584822601f9e3f8c024eb5eb252fe8cefb24d1317dc3d432e940ebb",
  "LastKnownBlockVersion-Alt": 0,
  "LastKnownBlockVersion-Major": 3,
  "LastKnownBlockVersion-Minor": 0,
  "MaxKnownMajorProtocolVersion": 2,
  "Protocol": "Cardano",
  "RequiresNetworkMagic": "RequiresNoMagic",
  "ShelleyGenesisFile": "mainnet-shelley-genesis.json",
  "ShelleyGenesisHash": "1a3be38bcbb7911969283716ad7aa550250226b76a61fc51cc9a9a35d9276d81",
  "TraceAcceptPolicy": false,
  "TraceBlockFetchClient": false,
  "TraceBlockFetchDecisions": true,
  "TraceBlockFetchProtocol": false,
  "TraceBlockFetchProtocolSerialised": false,
  "TraceBlockFetchServer": false,
  "TraceChainDb": false,
  "TraceChainSyncBlockServer": false,
  "TraceChainSyncClient": false,
  "TraceChainSyncHeaderServer": false,
  "TraceChainSyncProtocol": false,
  "TraceConnectionManager": false,
  "TraceDNSResolver": false,
  "TraceDNSSubscription": false,
  "TraceDiffusionInitialization": false,
  "TraceErrorPolicy": false,
  "TraceForge": true,
  "TraceHandshake": false,
  "TraceInboundGovernor": false,
  "TraceIpSubscription": false,
  "TraceLedgerPeers": false,
  "TraceLocalChainSyncProtocol": false,
  "TraceLocalErrorPolicy": false,
  "TraceLocalHandshake": false,
  "TraceLocalRootPeers": false,
  "TraceLocalTxSubmissionProtocol": false,
  "TraceLocalTxSubmissionServer": false,
  "TraceMempool": false,
  "TraceMux": false,
  "TracePeerSelection": false,
  "TracePeerSelectionActions": false,
  "TracePublicRootPeers": false,
  "TraceServer": false,
  "TraceTxInbound": false,
  "TraceTxOutbound": false,
  "TraceTxSubmissionProtocol": false,
  "TracingVerbosity": "NormalVerbosity",
  "TurnOnLogMetrics": false,
  "TurnOnLogging": true,
  "MaxConcurrencyBulkSync": 2,
  "MaxConcurrencyDeadline": 3,
  "defaultBackends": [
    "KatipBK"
  ],
  "defaultScribes": [
    [
      "StdoutSK",
      "stdout"
    ]
  ],
  "hasEKG": 12788,
  "hasPrometheus": [
    "0.0.0.0",
    12798
  ],
  "minSeverity": "Info",
  "options": {
    "mapBackends": {
      "cardano.node.metrics": [
        "EKGViewBK"
      ],
      "cardano.node.resources": [
        "EKGViewBK"
      ]
    },
    "mapSubtrace": {
      "cardano.node.metrics": {
        "subtrace": "Neutral"
      }
    }
  },
  "rotation": {
    "rpKeepFilesNum": 10,
    "rpLogLimitBytes": 5000000,
    "rpMaxAgeHours": 24
  },
  "setupBackends": [
    "KatipBK"
  ],
  "setupScribes": [
    {
      "scFormat": "ScText",
      "scKind": "StdoutSK",
      "scName": "stdout",
      "scRotation": null
    }
  ]
}

This one is one of my relay tracing for that metric, can't remember exactly which one it is, but I seem to remember to be "TraceBlockFetchDecisions": true,.

Anyway make a diff with your config and you'll easily find out the one or to to change.

dorin100 commented 1 year ago

This should be fixed in 1.35.3 release. Is it ok to close it?

nemo83 commented 1 year ago

This should be fixed in 1.35.3 release. Is it ok to close it?

Doing it now. Thanks