op-node p2p issue on node restarts

sbvegan commented 2 months ago

Bug Description

Replica nodes are failing to reconnect when the sequencer restarts.

Steps to Reproduce

We have a customer that has reported:

v1.9.1 of the stack whenever the sequencer node restarts, our own validator nodes are unable to reconnect to it via p2p with the following error: info dialing static peer addrs: [/ip4//tcp/9003] warn error reconnecting to static peer err: failed to dial : no addresses Restarting the validator node allows it to reconnect again, which implies that the connectivity is not an issue, but the internal state of the validator node. This issue is reproducible every time the sequencer node is restarted, which prevents propagation of unsafe blocks to validator nodes and creates disruption on the whole chain.

Expected behavior

The nodes should reconnect.

Environment Information:

Operating System: not sure
Package Version (or commit hash): p-node v1.9.1

Configurations:

Validator (Replica) Node

op_node_l1_eth_rpc: "wss://<l1_rpc_url>"
op_node_l1_rpc_kind: "standard"
op_node_l2_engine_auth: "/etc/secret-volume/jwt"
op_node_rollup_load_protocol_versions: "true"
op_node_rollup_halt: "major"
op_node_rollup_config: "/persistent/config/rollup.json"
op_node_sequencer_enabled: "false"
op_node_sequencer_l1_confs: "4"
op_node_verifier_l1_confs: "5"
op_node_log_format: "json"
op_node_log_level: "info"
op_node_p2p_disable: "false"
op_node_p2p_listen_ip: "0.0.0.0"
op_node_p2p_listen_tcp_port: "9003"
op_node_p2p_listen_udp_port: "9003"
op_node_p2p_peer_scoring: "none"
op_node_p2p_peer_banning: "false"
op_node_p2p_peer_banning_duration: "0h1m0s"
op_node_p2p_bootnodes: "enr:<bootnode_enr>"
op_node_p2p_advertise_tcp: "9003"
op_node_p2p_advertise_udp: "9003"
op_node_p2p_sync_req_resp: "true"
op_node_p2p_static: "<sequencer_multiaddr>"
op_node_rpc_addr: "0.0.0.0"
op_node_rpc_port: "8545"
op_node_rpc_enable_admin: "true"
op_node_snapshot_log: "/persistent/snapshot.log"
op_node_metrics_enabled: "true"
op_node_metrics_addr: "0.0.0.0"
op_node_metrics_port: "7300"
op_node_pprof_enabled: "true"
op_node_altda_enabled: "false"
op_node_altda_da_service: "true"
op_node_altda_da_server: "<altda_server_url>"
op_node_l1_beacon: "https://<beacon_api_url>"
op_node_p2p_priv_raw: "<p2p_pk>"
op_node_l2_engine_rpc: "http://<geth_url>"

Logs:

info dialing static peer addrs: [<ip>/tcp/9003]
warn error reconnecting to static peer err: failed to dial <enode>: no addresses

Additional context Add any other context about the problem here.

⚠️ Notice: Issues that do not include the following sections will be subject to closure:

Bug Description
Steps to Reproduce
Environment Information

Please ensure all required sections are filled out accurately to expedite the debugging process and improve issue resolution efficiency.

sbvegan commented 2 months ago

A second report:

As it happens, we actually experienced this (or some version of it) twice over the last few days. Based on our latest investigation, it appears that part of the problem is due to the following behaviour:

The Sequencers have the OP_NODE_P2P_STATIC env var set to the relay peer IDs.
At Sequencer startup time, this flag is respected and the sequencers appear to connect to the listed static peers as expected.
However, if one of the static peers restarts while the Sequnecer is running, the Sequencer attempts to re-connect to the peer via a different IP and not the one listed in OP_NODE_P2P_STATIC. The different IP is not accessible from the Sequencer so the reconnection attempt fails. A Sequencer restart generally fixes the issue - presumably because it connects via the IP specified in OP_NODE_P2P_STATIC rather than the other one. However, this makes Relay operations much riskier and operationally heavy since they always require a sequencer restart as well.

sbvegan commented 1 month ago

ethereum-optimism / optimism

op-node p2p issue on node restarts #12113

Validator (Replica) Node